Overview
The core utility auto_process.py implements pipeline stages
as subcommands. For a standard sequencing run the workflow looks
like:
auto_process.py setup initialises a new analysis directory for processing a sequencing run.
auto_process.py make_fastqs performs the demultiplexing of the raw sequencer output (BCL files) into Fastqs files. It is a wrapper around Illumina’s
bcl2fastq(for 10xGenomics single cell data it wrapscellranger) with additional functionality to fetch data, detect errors, analyse barcodes and generate statistics, and handle special cases.auto_process.py setup_analysis_dirs sets up individual project directories for each of the projects within the sequencing run.
auto_process.py run_qc runs the QC pipeline (including
fastqcandfastq_screen) on each of the project directories and generates a report for each one.auto_process.py publish_qc publishes the QC reports from make_fastqs and run_qc to a web-accessible location to be viewed by core facility staff and bioinformaticians.
auto_process.py archive copies the analysis directory and contents from the working area to the archive storage area, for subsequent bioinformatics analysis.
auto_process.py report generates reports on analysis directories in various formats.
Additional helper commands are available:
auto_process.py samplesheet checks the supplied samplesheet file for errors and allows it to be edited.
auto_process.py readme creates a
READMEtext file for notes.auto_process.py metadata allows the metadata associated with an analysis directory to be viewed and updated.
auto_process.py analyse_barcodes (re)analyses the barcode sequences in the Fastqs produced by
make_fastqs.auto_process.py merge_fastq_dirs merges together data outputs from multiple
make_fastqscommands.auto_process.py update_fastq_stats regenerates the statistics and processing reports.
Terminology
This documentation refers to various concepts such as analysis directories, analysis projects, sequencing runs and so on. We define these as follows:
(Sequencing) run: the primary sequencing data (i.e. BCL files) produced by a run of a sequencer
Analysis directory: the top-level directory which holds the outputs from
auto_processwhen processing data from a sequencing runAnalysis project: a subdirectory of an analysis directory which holds the Fastqs and QC outputs for a set of samples (aka a project) from the sequencing run; there can be multiple analysis projects within a single analysis directory
Additionally the terms data source, working area, QC server and archive storage refer to components of the compute infrastructure where the processing takes place:
Data source: location where the data from the sequencing run are located
Working area: location where the processing is performed, and which holds the analysis directory and contents during the processing
QC server: location where QC reports are published to, and which can be accessed via a web server
Archive storage: location where the final outputs (i.e. the analysis directory and projects) are copied to and stored once processing is completed, for subsequent analysis by the bioinformaticians
These can all be on the same filesystem on a single machine;
or one or more parts can be NFS filesystems, or even
filesystems mounted on other machines which are accessed
using ssh.
Supported sequencer platforms
The pipeline is currently used for output from the following Illumina sequencers:
NovaSeq 6000
HISeq 4000
MISeq
NextSeq
MiniSeq
iSeq
Earlier versions have been used on GAIIx and HISeq 2000/2500.
Supported single cell & spatial platforms
To varying degrees the pipeline supports handling data from a number of single cell platforms:
The pipeline also has limited support for the following spatial platforms:
Run and Fastq naming conventions
Sequencing runs Illumina sequencers produce output directories with the following naming structure:
<DATESTAMP>_<INSTRUMENT_ID>_<INSTRUMENT_RUN_NUMBER>_<FLOWCELL_ID>
For example:
181026_NB100234_0021_ABCDYHBGX7
181026is the datestamp (i.e. 26th October 2018). Some sequencers may use a four-digit year in the datestamp (e.g.20181026)NB100234is the instrument ID, which uniquely identifies the sequencing instrument which produced the data0021is the instrument run numberABCDYHBGX7is the ID of the flowcell used in the run
Fastq files generated by bcl2fastq have the following naming
structure:
<SAMPLE_NAME>_<SAMPLE_INDEX>_<LANE_ID>_<READ_NUMBER>_001.fastq.gz
For example:
SK1-control_S11_L003_R1_001.fastq.gz
SK1-controlis the sample nameS11is the sample index; it’s always of the formS<NUMBER>and is unique to each sampleL003is the lane ID; it’s always of theL<NUMBER>and identifies the lane that the reads in the Fastq came from.R1is the read number; paired-end runs will have a pair ofR1andR2Fastqs. Read numbers of the formI1are index reads
If the Fastq was generated without lane-splitting then the lane ID component will be missing from the name and the file will contain reads from all lanes the sample was run in; for example:
SK1-control_S11_R1_001.fastq.gz
Run IDs and run reference IDs
Within the auto_process package runs can be identified by
automatically generated run IDs of the general form:
PLATFORM_DATESTAMP[/INSTRUMENT_RUN_NUMBER]#FACILITY_RUN_NUMBER[.ANALYSIS_NUMBER]
where:
PLATFORMidentifies the sequencer platform and is always uppercased (e.g.NOVASEQ6000,MISEQ, etc)DATESTAMPis theYYMMDDdatestamp from the run name (e.g.140701)INSTRUMENT_RUN_NUMBERis the run number that forms part of the run name directory (e.g. for140701_SN0123_0045_000000000-A1BCDit would be45)FACILITY_RUN_NUMBERis the run number that has been assigned by the facilityANALYSIS_NUMBERis an optional arbitrary number that can be assigned to different analyses of the same run
For example:
NOVASEQ6000_230419/73#22
is a NovaSeq 6000 sequencer run with datestamp 230419,
instrument run number 73 and facility run number 22.
Typically the instrument run number for a run is the same as the number assigned by the facility; in these cases conventionally it is omitted and only the facility run number is used, for example:
NOVASEQ6000_230419#22
The special cases are handled as follows:
If the platform isn’t recognised supplied then the instrument name is used instead (e.g.
SN0123_140701/242#22)If the run name can’t be split into components then the general form will be
[PLATFORM_]RUN_NAME[#FACILITY_RUN_NUMBER]depending on whether platform and/or facility run number have been supplied (e.g. for a run calledrag_05_2017the run ID might look likerag_05_2017#90orMISEQ_rag_05_2017#90)
If an analysis number is assigned then the example run ID will look like:
NOVASEQ6000_230419#22.2
Run reference IDs are based on the run ID with additional arbitrary elements appended, i.e.:
RUNID[_EXTRAINFO]
Currently the following additional elements may appear if defined for the run:
Flow cell mode
For example:
NOVASEQ6000_230419#73_SP