Generating QC reports using `auto_process run_qc`

Overview

The run_qc command is used to run the QC pipeline on the Fastqs for each project within the analysis directory, and generate a report for each one.

The general invocation of the command is:

auto_process.py run_qc *options* [ANALYSIS_DIR]

There is also a standalone utility called run_qc.py which can be used to run the QC pipeline on an arbitrary subdirectory.

QC protocols

The QC pipeline protocol used for each project will differ slightly depending on the nature of the data within that project:

QC protocol	Description
`standardSE`	Standard single-end data (R1 Fastqs only)
`standardPE`	Standard paired-end data (R1/R2 Fastq pairs)
`10x_scRNAseq`	10xGenomics single cell RNA-seq
`10x_snRNAseq`	10xGenomics single nuclei RNA-seq
`10x_scATAC`	10xGenomics single cell ATAC-seq
`10x_Multiome_GEX`	10xGenomics single cell multiome gene expression data
`10x_Multiome_ATAC`	10xGenomics single cell multiome ATAC-seq data
`10x_CellPlex`	10xGenomics CellPlex cell multiplexing data
`10x_Flex`	10xGenomics fixed RNA profiling (Flex) data
`10x_ImmuneProfiling`	10xGenomics single cell immune profiling data
`10x_Visium`	10xGenomics Visium spatial RNA-seq
`10x_Visium_FFPE`	10xGenomics Visium FFPE spatial RNA-seq/GEX
`10x_Visium_FFPE_PEX`	10xGenomics Visium FFPE spatial PEX
`ParseEvercode`	Parse Biosciences Evercode data
`singlecell`	ICELL8 single cell RNA-seq
`ICELL8_scATAC`	ICELL8 single cell ATAC-seq

The protocol is determined automatically for each project, based on the metadata.

In turn each protocol defines a set of “QC modules” that are run by the QC pipeline, as well as “sequence data reads” (reads that contain biological data) and “index data reads” (reads that contain index data).

QC modules

The following QC modules are available within the QC pipeline:

QC module	Details
`fastqc`	Runs fastqc for general quality metrics
`fastq_screen`	Runs fastq_screen for a set of genome indexes, to verify sequences correspond to the expected organism (and check for contaminants)
`sequence_lengths`	Examines distribution of sequence lengths and checks for padding and masking
`strandedness`	Runs fastq_strand on the appropriate reads to indicate the strand-specificity of the sequence data (requires appropriate `STAR` indexes)
`picard_insert_size_metrics`	Runs picard `CollectInsertSizeMetrics` on paired data to determine insert sizes
`rseqc_genebody_coverage`	Runs rseqc `geneBody_coverage.py` to generate gene body coverage plot for all samples
`qualimap_rnaseq`	Runs qualimap `rnaseq` to generate various metrics including coverage and genomic origin of reads
`cellranger_count`	Single library analysis for each sample using cellranger `count`
`cellranger-atac_count`	Single library analysis for each sample using cellranger_atac `count`
`cellranger-arc_count`	Single cell multiome analysis using cellranger_arc `count` (requires 10x_multiome_libraries.info)
`cellranger_multi`	Cell multiplexing and fixed RNA profiling analyses using cellranger `multi` (requires 10x_multi_config.csv)

Appropriate reference data must be available (for example, STAR indexes or 10x Genomics reference datasets), and certain additional files (noted in the QC module descriptions above) may be required for some of the QC modules to run; typically the modules are skipped when appropriate reference data is not available.

On successful completion of the pipeline an HTML report is generated for each project; these are described in QC reports. By default multiQC is also run as part of the reporting.

If a QC server has been specified in the configuration then the reports can be copied there for sharing using the publish_qc command.

Note

The QC pipeline can be run outside of the auto_process pipeline by using the run_qc.py utility; see the section on running the QC standalone.

Biological versus non-biological data samples

For some types of dataset (e.g. 10x Genomics CellPlex data), not all samples in the dataset contain biological data (for example, CellPlex datasets also have “multiplexing capture” samples which contain feature barcodes).

Biological and non-biological data samples may be identified implicitly (for example, by using the library type information in the 10x_multi_config.csv file for CellPlex datasets). Alternatively samples with biological data can be explicitly defined in the Biological samples field of the README.info metadata file in the analysis project directory, as a comma-separated list of sample names. For example:

Biological samples    SMPL1,SMPL2

Samples in the project which are not in this list are treated as containing non-biological data; if no samples are listed then all samples are assumed to contain biological data.

When biological and non-biological samples are differentiated, the pipeline will only run and report “mapped” metrics (for example screens, strandedness, gene body coverage etc) for the biological samples; these metrics will be omitted for non-biological samples (even if they have been specified as part of the QC protocol).

QC metrics using subsequences in reads

Some metrics can be applied to subsequences within reads (rather than the whole sequence), if a range of bases is defined within the protocol.

Specifically, where subsequences are specified for sequence data reads (i.e. reads containing biological data) then the “mapped” QC modules (for example, FastqScreen or strandedness) will only use those subsequences.

(See the QC protocol specification documentation for the subsequence specification syntax.)

Per-lane QC metrics for undetermined Fastqs

By default when handling Fastqs for the undetermined project, the pipeline runs in mode whereby it generates copies of the input Fastqs for each lane that appears in the read headers of each Fastq, and then run the QC on those per-lane Fastqs (rather the originals, which are not changed).

This results in per-lane QC metrics, which can be useful for diagnostic purposes when handling Fastqs which have been merged across multiple lanes (for example, to determine whether contanimation is confined to a single lane).

The behaviour is controlled by the split_undetermined_fastqs setting in the qc section of the configuration file (see QC pipeline configuration).

Including external (non-pipeline) outputs

It is possible to include links within the QC reports to additional output files produced outside of the pipeline, by specifying them within an extra_outputs.tsv file.

If this file is present in the QC directory being reported then any additional external files which it specifies will be linked from the “Extra outputs” section in the report. The specified files will also be included in the ZIP archive.

Note

The additional files must also be within the QC directory.

Each line in the TSV file specifies an output file to include, with the minimal specification being:

FILE_PATH     DESCRIPTION

For example:

extra_metrics/metrics.html    Manually generated QC metrics

If the output has additional associated files or directories (for example, the main output is an index file that then links to other files) then these can be included as a comma-separated list via an optional third field in the TSV file:

FILE_PATH     DESCRIPTION     PATH1[,PATH2[,...]]

For example:

more_metrics/index.html    More QC metrics    more_metrics/results

The additional files will then be included in the ZIP archive.

Note

If any additional “files” are actually subdirectories then the contents of those directories will also be included automatically.

Additional options

For 10xGenomics datasets, the following options can be used to override the defaults defined in the configuration:

--cellranger: explicitly sets the path to the cellranger (or other appropriate 10xGenomics package)
--10x_force_cells: explicitly specify the number of cells, overriding automatic cell detection algorithms (i.e. set the --force-cells option for CellRanger)
--10x_extra_projects: specify additional project directories to fetch Fastqs from when running single library analyses (i.e. add the Fastq directory paths for each project to the --fastqs option for CellRanger)

Generating QC reports using auto_process run_qc

Overview

QC protocols

QC modules

Biological versus non-biological data samples

QC metrics using subsequences in reads

Per-lane QC metrics for undetermined Fastqs

Including external (non-pipeline) outputs

Additional options

Generating QC reports using `auto_process run_qc`