QC protocol specification

Overview

Different QC protocol definitions within auto_process are used to run the appropriate metrics for different types of data within the QC pipeline, by specifying which metrics should be used on which reads within a dataset.

Each protocol is defined by the following attributes:

Attribute

Description

Name

Short name for the protocol

Description

Longer free-text description

Sequence data reads

Reads with biological data

Index data reads

Reads with index data

QC modules

List of QC modules and arguments

Protocol specifications

The attributes for a protocol can be specified using a string of the form:

NAME:DESCRIPTION:seq_reads=SEQ_READS:index_reads=INDEX_READS:qc_modules=QC_MODULES

where:

Field

Description

Example

NAME

Protocol name

standardPE

DESCRIPTION

Description text

Standard paired-end data

SEQ_READS

List of reads

[r1,r2]

INDEX_READS

List of reads

[r1]

QC_MODULES

List of QC modules

[fastqc,fastq_screen]

For example:

basicQC:'Basic QC protocol':seq_reads=[r1,r2]:index_reads=[]:qc_modules=[sequence_lengths,fastqc]

The name and description must always be the first two fields of a protocol specification. The seq_reads, index_reads and qc_modules fields can appear in any order and can be omitted if empty. The same read ID should not appear in both seq_reads and index_reads.

Protocol names and descriptions

Names cannot contain spaces or colons; descriptions can contain any characters (including spaces).

Optionally, descriptions can be enclosed in matching quotes (either single or double); these should be used if the description includes special characters such as colons or quotes.

Sequence data and index reads

QC protocols distinguish between “sequence data” and “index” reads as follows:

  • Sequence data reads are those reads containing biologically significant data, and are used for mapped metrics;

  • Index reads contain non-biologically significant data (e.g. feature barcodes). Non-mapped metrics are run on both sequence data and index data reads.

Reads can optionally also specify subsequences to use for QC, for example for 10x Genomics Flex data only the first 50 bases of R1 contains the biologically significant data. Subsequences can be specified by ranges attached to reads, for example in this case seq_reads=[r2:1-50], and tell the pipeline to only use these portions of the the reads when running the mapped metrics.

QC modules

QC modules specific which software will be run and which metrics will be produced.

The following QC modules are available:

QC module

Software

Metrics

fastqc

FastQC

General stats and quality information

fastq_screen

FastqQScreen

Screens against panels of organisms

picard_insert_size_metrics

Picard CollectInsertSizeMetrics

Insert sizes

qualimap_rnaseq

Qualimap ‘rnaseq’

Coverage and genomic origin of reads

rseqc_genebody_coverage

RSeQC geneBody_coverage.py

Coverage

sequence_lengths

Built-in

Sequence length and masking stats

strandedness

fastq_strand.py

Strandedness information

cellranger_count

CellRanger ‘count’

Single library analysis

cellranger-atac_count

CellRanger-ATAC ‘count’

Single library analysis

cellranger-arc_count

Cellranger-ARC ‘count’

Single library analysis

cellranger_multi

CellRanger ‘multi’

Multiplexing analysis

Some QC modules allow additional optional arguments to be specified, to modify either the metric or pipeline behaviour, or how the metric validation is performed.

Arguments are specified using the general syntax:

QC_MODULE(ARG1=VALUE1;ARG2=VALUE2;...)

For example:

cellranger_count(set_cell_count=false;set_metadata=false)

The following options are recognised:

Option

QC modules

Description

chemistry=CHEMISTRY

cellranger[*]_count

Explicitly set chemistry when running CellRanger ‘count’ (e.g. ARC-v1)

library=LIBRARY_TYPE

cellranger[*]_count

Explicitly set library type in pipeline tasks (e.g. snRNA-seq)

cellranger_version=VERSION

cellranger[*]_count

Set expected CellRanger version for validation (default is the version identified by the pipeline; use * to match any version)

cellranger_refdata=REFDATA

cellranger[*]_count

Set expected CellRanger reference dataset for validation (default is the reference identified by the pipeline; use * to match any reference dataset)

set_cell_count=true|false

cellranger[*]_count

Whether to use outputs to set the cell count (default is true)

set_metadata=true|false

cellranger[*]_count

Whether to set metadata from this module (default is true)

cellranger_use_multi_config=true|false

cellranger_count

If true then use CellRanger ‘multi’ config file to identify samples with biological data (default is false)