QC protocol specification ========================= -------- Overview -------- Different QC protocol definitions within ``auto_process`` are used to run the appropriate metrics for different types of data within the QC pipeline, by specifying which metrics should be used on which reads within a dataset. Each protocol is defined by the following attributes: =================== ============================ Attribute Description =================== ============================ Name Short name for the protocol Description Longer free-text description Sequence data reads Reads with biological data Index data reads Reads with index data QC modules List of QC modules and arguments =================== ============================ ----------------------- Protocol specifications ----------------------- The attributes for a protocol can be specified using a string of the form: :: NAME:DESCRIPTION:seq_reads=SEQ_READS:index_reads=INDEX_READS:qc_modules=QC_MODULES where: =============== ================== ====================== Field Description Example =============== ================== ====================== ``NAME`` Protocol name ``standardPE`` ``DESCRIPTION`` Description text ``Standard paired-end data`` ``SEQ_READS`` List of reads ``[r1,r2]`` ``INDEX_READS`` List of reads ``[r1]`` ``QC_MODULES`` List of QC modules ``[fastqc,fastq_screen]`` =============== ================== ====================== For example: :: basicQC:'Basic QC protocol':seq_reads=[r1,r2]:index_reads=[]:qc_modules=[sequence_lengths,fastqc] The name and description must always be the first two fields of a protocol specification. The ``seq_reads``, ``index_reads`` and ``qc_modules`` fields can appear in any order and can be omitted if empty. The same read ID should not appear in both ``seq_reads`` and ``index_reads``. ------------------------------- Protocol names and descriptions ------------------------------- Names cannot contain spaces or colons; descriptions can contain any characters (including spaces). Optionally, descriptions can be enclosed in matching quotes (either single or double); these should be used if the description includes special characters such as colons or quotes. .. _qc_protocols_seq_and_index_reads: ----------------------------- Sequence data and index reads ----------------------------- QC protocols distinguish between "sequence data" and "index" reads as follows: * Sequence data reads are those reads containing biologically significant data, and are used for mapped metrics; * Index reads contain non-biologically significant data (e.g. feature barcodes). Non-mapped metrics are run on both sequence data and index data reads. Reads can optionally also specify subsequences to use for QC, for example for 10x Genomics Flex data only the first 50 bases of R1 contains the biologically significant data. Subsequences can be specified by ranges attached to reads, for example in this case ``seq_reads=[r2:1-50]``, and tell the pipeline to only use these portions of the the reads when running the mapped metrics. .. _qc_protocols_qc_modules: ---------- QC modules ---------- QC modules specific which software will be run and which metrics will be produced. The following QC modules are available: ============================== =============================== ========== QC module Software Metrics ============================== =============================== ========== ``fastqc`` FastQC General stats and quality information ``fastq_screen`` FastqQScreen Screens against panels of organisms ``picard_insert_size_metrics`` Picard CollectInsertSizeMetrics Insert sizes ``qualimap_rnaseq`` Qualimap 'rnaseq' Coverage and genomic origin of reads ``rseqc_genebody_coverage`` RSeQC geneBody_coverage.py Coverage ``rseqc_infer_experiment`` RSeQC infer_experiment.py Strand-specificity ``sequence_lengths`` Built-in Sequence length and masking stats ``strandedness`` fastq_strand.py Strandedness information ``cellranger_count`` CellRanger 'count' Single library analysis ``cellranger-atac_count`` CellRanger-ATAC 'count' Single library analysis ``cellranger-arc_count`` Cellranger-ARC 'count' Single library analysis ``cellranger_multi`` CellRanger 'multi' Multiplexing analysis ============================== =============================== ========== Some QC modules allow additional optional arguments to be specified, to modify either the metric or pipeline behaviour, or how the metric validation is performed. Arguments are specified using the general syntax: :: QC_MODULE(ARG1=VALUE1;ARG2=VALUE2;...) For example: :: cellranger_count(set_cell_count=false;set_metadata=false) The following options are recognised: ========================================== =================== =========== Option QC modules Description ========================================== =================== =========== ``chemistry=CHEMISTRY`` cellranger[*]_count Explicitly set chemistry when running CellRanger 'count' (e.g. ``ARC-v1``) ``library=LIBRARY_TYPE`` cellranger[*]_count Explicitly set library type in pipeline tasks (e.g. ``snRNA-seq``) ``cellranger_version=VERSION`` cellranger[*]_count Set expected CellRanger version for validation (default is the version identified by the pipeline; use ``*`` to match any version) ``cellranger_refdata=REFDATA`` cellranger[*]_count Set expected CellRanger reference dataset for validation (default is the reference identified by the pipeline; use ``*`` to match any reference dataset) ``set_cell_count=true|false`` cellranger[*]_count Whether to use outputs to set the cell count (default is ``true``) ``set_metadata=true|false`` cellranger[*]_count Whether to set metadata from this module (default is ``true``) ``cellranger_use_multi_config=true|false`` cellranger_count If ``true`` then use CellRanger 'multi' config file to identify samples with biological data (default is ``false``) ========================================== =================== ===========