Analysis and Project Directories

Analysis directories

The top-level analysis directory is created by the setup command, and typically will have a name based on the sequencing run name with the suffix _analysis, for example:

180817_M00123_0001_000000000-BV1X2_analysis

The analysis directory will contain the following files and directories produced by the auto_process commands:

File or Directory

Description and contents

Stage

auto_process.info

Parameter file used for processing

setup

metadata.info

Metadata for the run

setup

logs/

Log files from each processing command

setup

ScriptCode/

Directory for custom user scripts

setup

SampleSheet.orig.csv

Copy of the original sample sheet file

setup

custom_SampleSheet.csv

Updated version of sample sheet used for processing

setup

primary_data/

Raw sequencing data

make_fastqs

processing_qc.html

Processing QC report

make_fastqs

per_lane_stats.info

Per-lane statistics

make_fastqs

per_lane_sample_stats.info

Per-lane statistics for samples

make_fastqs

statistics_full.info

Per-Fastq statistics

make_fastqs

statistics.info

Per-Fastq statistics

make_fastqs

projects.info

Metadata for all projects

make_fastqs

<bcl2fastq>/

Output from bcl2fastq (can be set explicitly using the --output-dir option)

make_fastqs

<PROJECT>/

Project directory (one for each project defined in projects.info)

setup_analysis_dirs

undetermined/

Project directory for undetermined reads

setup_analysis_dirs

README.txt

Text file with user notes on the run (e.g. unusual processing steps)

readme

Analysis directory metadata

Each analysis has additional data items associated with it which are stored in the metadata.info file.

The most commonly used metadata items are listed in the table below:

Item

Description

run_number

Facility-assigned identifier which can differ from the instrument run number

source

Source of the sequencing data, for example the name of the facility, institution or service that provided it

platform

The sequencing platform (e.g. miseq)

bcl2fastq_software

Location and version of the package used to perform the Fastq generation

The full set of metadata items and values for can be viewed using the metadata command:

auto_process.py metadata [ANALYSIS_DIR]

This metadata is not required for processing, but should be set before the QC is published and the analysis is completed. Items can be set or updated using the --set option of the metadata command, for example:

auto_process.py metadata --set run_number=88

Run reference ID

Each analysis directory has a “run reference ID” which is generated automatically from the associated metadata (specifically the platform, run datestamp, instrument run number and facility run number).

The general form of the reference ID is:

PLATFORM_DATESTAMP[/INSTRUMENT_RUN_NUMBER]#FACILITY_RUN_NUMBER

The instrument run number is included if it differs from the facility run number (or if the facility run number is not supplied).

For example:

HISEQ4000_181029/88#72

is a run from a HiSeq 4000 instrument with datestamp 181029 and instrument run number 88; the facility assigned run ID 72 to the run as its local identifier.

MISEQ_180912#3

is a run a MiSeq instrument with datestamp 180912, where both the instrument and facility run numbers are 3.

Project directories

Project directories are created within the analysis directory by the setup command command, based on the contents of the projects.info file.

Each project directory will contain the following files and directories:

File or Directory

Description and contents

README.info

Project metadata

fastqs/

Fastq files

ScriptCode/

Directory for custom user scripts

qc/

QC pipeline outputs

qc_report.html

QC report

qc_report.PROJECT.RUN.zip

ZIP file containing all QC outputs and reports

multiqc_report.html

multiqc outputs

multiqc_report_data/

Data associated with multiqc

cellranger_count/

Full outputs from cellranger count single library analyses (10xGenomics projects only)

Project directory metadata

Each analysis project has additional data items associated with it which are stored in project’s README.info file.

The most commonly used metadata items are listed in the table below:

Item

Description

Name

Project name

Run

Parent run name

Platform

Sequencing platform (e.g. novaseq600)

Sequencer model

Model of sequencer used (e.g. NovaSeq 6000)

User

Name of the user(s)

PI

Name of PI(s)

Organism

Organism name(s)

Library type

The type of experiment (e.g. RNA-seq)

Single cell platform

Single cell platform, if applicable

Number of cells

Number of cells (single cell only)

ICELL8 well list

Well list file (ICELL8 only)

Paired_end

Whether the data are single- or paired- end

Primary fastqs

Subdirectory holding the ‘primary’ set of Fastq files for the project

Samples

Number and list of sample names for physical samples

Biological samples

List of subset of physical samples with biological data (if not set then all samples assumed to contain biological information)

Multiplexed samples

List of sample names for multiplexed for 10x Genomics CellPlex and Flex data and Parse Evercode data (if not set then there are no multiplexed samples; if set to ? then there are multiplexed samples but the names are not known)

Comments

Any additional comments about the project

Typically most of the values are populated at setup time from the contents of the projects.info file (see Setting up analysis directories), with the others being set automatically (for example after running single cell analyses).

Multiple Fastq sets within projects

Normally each project will only have one set of Fastq files associated with it, and these will be in the fastqs subdirectory of the project directory.

However some analyses may have more than one sets of associated Fastqs, and in these cases there will be multiple subdirectories (each of which contains one of these sets).

For example, ICELL8 single cell projects typically have two or three sets of Fastqs:

  • fastqs.samples are the Fastqs after filtering and QC, with the reads assigned to samples according to the well list file

  • fastqs.barcodes are the Fastqs after filtering and QC, with the reads assigned to barcodes according to the well list file

  • (if present) fastqs are the original Fastq files produced by the BCL to Fastq conversion, without any additional filtering or QC

The project metadata file includes the item Primary fastqs which indicates which of the Fastq sets is the principal one.

undetermined project directory

This is a special project that is created for storage and QC of the reads which couldn’t be assigned to any samples by the Fastq generation.