Generating QC reports using auto_process run_qc

Overview

The run_qc command is used to run the QC pipeline on the Fastqs for each project within the analysis directory, and generate a report for each one.

The general invocation of the command is:

auto_process.py run_qc *options* [ANALYSIS_DIR]

There is also a standalone utility called run_qc.py which can be used to run the QC pipeline on an arbitrary subdirectory.

QC protocols

The QC pipeline protocol used for each project will differ slightly depending on the nature of the data within that project:

QC protocol

Description

standardSE

Standard single-end data (R1 Fastqs only)

standardPE

Standard paired-end data (R1/R2 Fastq pairs)

10x_scRNAseq

10xGenomics single cell RNA-seq

10x_snRNAseq

10xGenomics single nuclei RNA-seq

10x_scATAC

10xGenomics single cell ATAC-seq

10x_Multiome_GEX

10xGenomics single cell multiome gene expression data

10x_Multiome_ATAC

10xGenomics single cell multiome ATAC-seq data

10x_CellPlex

10xGenomics CellPlex cell multiplexing data

10x_Flex

10xGenomics fixed RNA profiling (Flex) data

10x_ImmuneProfiling

10xGenomics single cell immune profiling data

10x_Visium

10xGenomics Visium spatial RNA-seq

10x_Visium_FFPE

10xGenomics Visium FFPE spatial RNA-seq/GEX

10x_Visium_FFPE_PEX

10xGenomics Visium FFPE spatial PEX

ParseEvercode

Parse Biosciences Evercode data

singlecell

ICELL8 single cell RNA-seq

ICELL8_scATAC

ICELL8 single cell ATAC-seq

The protocol is determined automatically for each project, based on the metadata.

In turn each protocol defines a set of “QC modules” that are run by the QC pipeline, as well as “sequence data reads” (reads that contain biological data) and “index data reads” (reads that contain index data).

QC modules

The following QC modules are available within the QC pipeline:

QC module

Details

fastqc

Runs fastqc for general quality metrics

fastq_screen

Runs fastq_screen for a set of genome indexes, to verify sequences correspond to the expected organism (and check for contaminants)

sequence_lengths

Examines distribution of sequence lengths and checks for padding and masking

strandedness

Runs fastq_strand on the appropriate reads to indicate the strand-specificity of the sequence data (requires appropriate STAR indexes)

picard_insert_size_metrics

Runs picard CollectInsertSizeMetrics on paired data to determine insert sizes

rseqc_genebody_coverage

Runs rseqc geneBody_coverage.py to generate gene body coverage plot for all samples

qualimap_rnaseq

Runs qualimap rnaseq to generate various metrics including coverage and genomic origin of reads

cellranger_count

Single library analysis for each sample using cellranger count

cellranger-atac_count

Single library analysis for each sample using cellranger_atac count

cellranger-arc_count

Single cell multiome analysis using cellranger_arc count (requires 10x_multiome_libraries.info)

cellranger_multi

Cell multiplexing and fixed RNA profiling analyses using cellranger multi (requires 10x_multi_config.csv)

Appropriate reference data must be available (for example, STAR indexes or 10x Genomics reference datasets), and certain additional files (noted in the QC module descriptions above) may be required for some of the QC modules to run; typically the modules are skipped when appropriate reference data is not available.

On successful completion of the pipeline an HTML report is generated for each project; these are described in QC reports. By default multiQC is also run as part of the reporting.

If a QC server has been specified in the configuration then the reports can be copied there for sharing using the publish_qc command.

Note

The QC pipeline can be run outside of the auto_process pipeline by using the run_qc.py utility; see the section on running the QC standalone.

Biological versus non-biological data samples

For some types of dataset (e.g. 10x Genomics CellPlex data), not all samples in the dataset contain biological data (for example, CellPlex datasets also have “multiplexing capture” samples which contain feature barcodes).

Biological and non-biological data samples may be identified implicitly (for example, by using the library type information in the 10x_multi_config.csv file for CellPlex datasets). Alternatively samples with biological data can be explicitly defined in the Biological samples field of the README.info metadata file in the analysis project directory, as a comma-separated list of sample names. For example:

Biological samples    SMPL1,SMPL2

Samples in the project which are not in this list are treated as containing non-biological data; if no samples are listed then all samples are assumed to contain biological data.

When biological and non-biological samples are differentiated, the pipeline will only run and report “mapped” metrics (for example screens, strandedness, gene body coverage etc) for the biological samples; these metrics will be omitted for non-biological samples (even if they have been specified as part of the QC protocol).

QC metrics using subsequences in reads

Some metrics can be applied to subsequences within reads (rather than the whole sequence), if a range of bases is defined within the protocol.

Specifically, where subsequences are specified for sequence data reads (i.e. reads containing biological data) then the “mapped” QC modules (for example, FastqScreen or strandedness) will only use those subsequences.

(See the QC protocol specification documentation for the subsequence specification syntax.)

Per-lane QC metrics for undetermined Fastqs

By default when handling Fastqs for the undetermined project, the pipeline runs in mode whereby it generates copies of the input Fastqs for each lane that appears in the read headers of each Fastq, and then run the QC on those per-lane Fastqs (rather the originals, which are not changed).

This results in per-lane QC metrics, which can be useful for diagnostic purposes when handling Fastqs which have been merged across multiple lanes (for example, to determine whether contanimation is confined to a single lane).

The behaviour is controlled by the split_undetermined_fastqs setting in the qc section of the configuration file (see QC pipeline configuration).

Including external (non-pipeline) outputs

It is possible to include links within the QC reports to additional output files produced outside of the pipeline, by specifying them within an extra_outputs.tsv file.

If this file is present in the QC directory being reported then any additional external files which it specifies will be linked from the “Extra outputs” section in the report. The specified files will also be included in the ZIP archive.

Note

The additional files must also be within the QC directory.

Each line in the TSV file specifies an output file to include, with the minimal specification being:

FILE_PATH     DESCRIPTION

For example:

extra_metrics/metrics.html    Manually generated QC metrics

If the output has additional associated files or directories (for example, the main output is an index file that then links to other files) then these can be included as a comma-separated list via an optional third field in the TSV file:

FILE_PATH     DESCRIPTION     PATH1[,PATH2[,...]]

For example:

more_metrics/index.html    More QC metrics    more_metrics/results

The additional files will then be included in the ZIP archive.

Note

If any additional “files” are actually subdirectories then the contents of those directories will also be included automatically.

Additional options

For 10xGenomics datasets, the following options can be used to override the defaults defined in the configuration:

  • --cellranger: explicitly sets the path to the cellranger (or other appropriate 10xGenomics package)

  • --10x_force_cells: explicitly specify the number of cells, overriding automatic cell detection algorithms (i.e. set the --force-cells option for CellRanger)

  • --10x_extra_projects: specify additional project directories to fetch Fastqs from when running single library analyses (i.e. add the Fastq directory paths for each project to the --fastqs option for CellRanger)