******** Overview ******** The core utility ``auto_process.py`` implements pipeline stages as subcommands. For a standard sequencing run the workflow looks like: * :doc:`auto_process.py setup ` initialises a new analysis directory for processing a sequencing run. * :doc:`auto_process.py make_fastqs ` performs the demultiplexing of the raw sequencer output (BCL files) into Fastqs files. It is a wrapper around Illumina's ``bcl2fastq`` (for 10xGenomics single cell data it wraps ``cellranger``) with additional functionality to fetch data, detect errors, analyse barcodes and generate statistics, and handle special cases. * :doc:`auto_process.py setup_analysis_dirs ` sets up individual project directories for each of the projects within the sequencing run. * :doc:`auto_process.py run_qc ` runs the QC pipeline (including ``fastqc`` and ``fastq_screen``) on each of the project directories and generates a report for each one. * :doc:`auto_process.py publish_qc ` publishes the QC reports from make_fastqs and run_qc to a web-accessible location to be viewed by core facility staff and bioinformaticians. * :doc:`auto_process.py archive ` copies the analysis directory and contents from the working area to the archive storage area, for subsequent bioinformatics analysis. * :doc:`auto_process.py report ` generates reports on analysis directories in various formats. Additional helper commands are available: * :ref:`auto_process.py samplesheet ` checks the supplied samplesheet file for errors and allows it to be edited. * :ref:`auto_process.py readme ` creates a ``README`` text file for notes. * :ref:`auto_process.py metadata ` allows the metadata associated with an analysis directory to be viewed and updated. * :ref:`auto_process.py analyse_barcodes ` (re)analyses the barcode sequences in the Fastqs produced by ``make_fastqs``. * :ref:`auto_process.py merge_fastq_dirs ` merges together data outputs from multiple ``make_fastqs`` commands. * :ref:`auto_process.py update_fastq_stats ` regenerates the statistics and processing reports. =========== Terminology =========== This documentation refers to various concepts such as *analysis directories*, *analysis projects*, *sequencing runs* and so on. We define these as follows: * **(Sequencing) run**: the primary sequencing data (i.e. BCL files) produced by a run of a sequencer * **Analysis directory**: the top-level directory which holds the outputs from ``auto_process`` when processing data from a sequencing run * **Analysis project**: a subdirectory of an analysis directory which holds the Fastqs and QC outputs for a set of samples (aka a *project*) from the sequencing run; there can be multiple analysis projects within a single analysis directory Additionally the terms *data source*, *working area*, *QC server* and *archive storage* refer to components of the compute infrastructure where the processing takes place: * **Data source**: location where the data from the sequencing run are located * **Working area**: location where the processing is performed, and which holds the analysis directory and contents during the processing * **QC server**: location where QC reports are published to, and which can be accessed via a web server * **Archive storage**: location where the final outputs (i.e. the analysis directory and projects) are copied to and stored once processing is completed, for subsequent analysis by the bioinformaticians These can all be on the same filesystem on a single machine; or one or more parts can be NFS filesystems, or even filesystems mounted on other machines which are accessed using ``ssh``. ============================= Supported sequencer platforms ============================= The pipeline is currently used for output from the following Illumina sequencers: * NovaSeq 6000 * HISeq 4000 * MISeq * NextSeq * MiniSeq * iSeq Earlier versions have been used on GAIIx and HISeq 2000/2500. ========================================= Supported single cell & spatial platforms ========================================= To varying degrees the pipeline supports handling data from a number of single cell platforms: * :doc:`Handling 10x Genomics single cell data ` * :doc:`Handling Parse single cell data ` * :doc:`Handling BioRad single cell data ` The pipeline also has limited support for the following spatial platforms: * :doc:`Handling 10x Genomics Visium data ` .. _run_and_fastq_naming_conventions: ================================ Run and Fastq naming conventions ================================ Sequencing runs Illumina sequencers produce output directories with the following naming structure: :: ___ For example: :: 181026_NB100234_0021_ABCDYHBGX7 * ``181026`` is the datestamp (i.e. 26th October 2018). Some sequencers may use a four-digit year in the datestamp (e.g. ``20181026``) * ``NB100234`` is the instrument ID, which uniquely identifies the sequencing instrument which produced the data * ``0021`` is the instrument run number * ``ABCDYHBGX7`` is the ID of the flowcell used in the run Fastq files generated by ``bcl2fastq`` have the following naming structure: :: ____001.fastq.gz For example: :: SK1-control_S11_L003_R1_001.fastq.gz * ``SK1-control`` is the sample name * ``S11`` is the sample index; it's always of the form ``S`` and is unique to each sample * ``L003`` is the lane ID; it's always of the ``L`` and identifies the lane that the reads in the Fastq came from. * ``R1`` is the read number; paired-end runs will have a pair of ``R1`` and ``R2`` Fastqs. Read numbers of the form ``I1`` are index reads If the Fastq was generated without lane-splitting then the lane ID component will be missing from the name and the file will contain reads from all lanes the sample was run in; for example: :: SK1-control_S11_R1_001.fastq.gz ============================= Run IDs and run reference IDs ============================= Within the ``auto_process`` package runs can be identified by automatically generated **run IDs** of the general form: :: PLATFORM_DATESTAMP[/INSTRUMENT_RUN_NUMBER]#FACILITY_RUN_NUMBER[.ANALYSIS_NUMBER] where: * ``PLATFORM`` identifies the sequencer platform and is always uppercased (e.g. ``NOVASEQ6000``, ``MISEQ``, etc) * ``DATESTAMP`` is the ``YYMMDD`` datestamp from the run name (e.g. ``140701``) * ``INSTRUMENT_RUN_NUMBER`` is the run number that forms part of the run name directory (e.g. for ``140701_SN0123_0045_000000000-A1BCD`` it would be ``45``) * ``FACILITY_RUN_NUMBER`` is the run number that has been assigned by the facility * ``ANALYSIS_NUMBER`` is an optional arbitrary number that can be assigned to different analyses of the same run For example: :: NOVASEQ6000_230419/73#22 is a NovaSeq 6000 sequencer run with datestamp ``230419``, instrument run number ``73`` and facility run number ``22``. Typically the instrument run number for a run is the same as the number assigned by the facility; in these cases conventionally it is omitted and only the facility run number is used, for example: :: NOVASEQ6000_230419#22 The special cases are handled as follows: * If the platform isn't recognised supplied then the instrument name is used instead (e.g. ``SN0123_140701/242#22``) * If the run name can't be split into components then the general form will be ``[PLATFORM_]RUN_NAME[#FACILITY_RUN_NUMBER]`` depending on whether platform and/or facility run number have been supplied (e.g. for a run called ``rag_05_2017`` the run ID might look like ``rag_05_2017#90`` or ``MISEQ_rag_05_2017#90``) If an analysis number is assigned then the example run ID will look like: :: NOVASEQ6000_230419#22.2 **Run reference IDs** are based on the run ID with additional arbitrary elements appended, i.e.: :: RUNID[_EXTRAINFO] Currently the following additional elements may appear if defined for the run: * Flow cell mode For example: :: NOVASEQ6000_230419#73_SP