QC protocol specification
=========================

--------
Overview
--------

Different QC protocol definitions within ``auto_process``
are used to run the appropriate metrics for different
types of data within the QC pipeline, by specifying
which metrics should be used on which reads within
a dataset.

Each protocol is defined by the following attributes:

=================== ============================
Attribute           Description
=================== ============================
Name                Short name for the protocol
Description         Longer free-text description 
Sequence data reads Reads with biological data
Index data reads    Reads with index data
QC modules          List of QC modules and arguments
=================== ============================

-----------------------
Protocol specifications
-----------------------

The attributes for a protocol can be specified using a
string of the form:

::

   NAME:DESCRIPTION:seq_reads=SEQ_READS:index_reads=INDEX_READS:qc_modules=QC_MODULES

where:

=============== ================== ======================
Field           Description        Example
=============== ================== ======================
``NAME``        Protocol name      ``standardPE``
``DESCRIPTION`` Description text   ``Standard paired-end data``
``SEQ_READS``   List of reads      ``[r1,r2]``
``INDEX_READS`` List of reads      ``[r1]``
``QC_MODULES``  List of QC modules ``[fastqc,fastq_screen]``
=============== ================== ======================

For example:

::

   basicQC:'Basic QC protocol':seq_reads=[r1,r2]:index_reads=[]:qc_modules=[sequence_lengths,fastqc]

The name and description must always be the first two fields
of a protocol specification. The ``seq_reads``, ``index_reads``
and ``qc_modules`` fields can appear in any order and can be
omitted if empty. The same read ID should not appear in both
``seq_reads`` and ``index_reads``.

-------------------------------
Protocol names and descriptions
-------------------------------

Names cannot contain spaces or colons; descriptions can
contain any characters (including spaces).

Optionally, descriptions can be enclosed in matching quotes
(either single or double); these should be used if the
description includes special characters such as colons or
quotes.

.. _qc_protocols_seq_and_index_reads:

-----------------------------
Sequence data and index reads
-----------------------------

QC protocols distinguish between "sequence data" and "index"
reads as follows:

* Sequence data reads are those reads containing biologically
  significant data, and are used for mapped metrics;
* Index reads contain non-biologically significant data (e.g.
  feature barcodes). Non-mapped metrics are run on both
  sequence data and index data reads.

Reads can optionally also specify subsequences to use for
QC, for example for 10x Genomics Flex data only the first
50 bases of R1 contains the biologically significant data.
Subsequences can be specified by ranges attached to reads,
for example in this case ``seq_reads=[r2:1-50]``, and tell
the pipeline to only use these portions of the the reads
when running the mapped metrics.

.. _qc_protocols_qc_modules:

----------
QC modules
----------

QC modules specific which software will be run and which
metrics will be produced.

The following QC modules are available:

============================== =============================== ==========
QC module                      Software                        Metrics
============================== =============================== ==========
``fastqc``                     FastQC                          General stats and quality information
``fastq_screen``               FastqQScreen                    Screens against panels of organisms
``picard_insert_size_metrics`` Picard CollectInsertSizeMetrics Insert sizes
``qualimap_rnaseq``            Qualimap 'rnaseq'               Coverage and genomic origin of reads
``rseqc_genebody_coverage``    RSeQC geneBody_coverage.py      Coverage
``rseqc_infer_experiment``     RSeQC infer_experiment.py       Strand-specificity
``sequence_lengths``           Built-in                        Sequence length and masking stats
``strandedness``               fastq_strand.py                 Strandedness information
``cellranger_count``           CellRanger 'count'              Single library analysis
``cellranger-atac_count``      CellRanger-ATAC 'count'         Single library analysis
``cellranger-arc_count``       Cellranger-ARC 'count'          Single library analysis
``cellranger_multi``           CellRanger 'multi'              Multiplexing analysis
============================== =============================== ==========

Some QC modules allow additional optional arguments to be
specified, to modify either the metric or pipeline behaviour,
or how the metric validation is performed.

Arguments are specified using the general syntax:

::

   QC_MODULE(ARG1=VALUE1;ARG2=VALUE2;...)

For example:

::

   cellranger_count(set_cell_count=false;set_metadata=false)

The following options are recognised:

========================================== =================== ===========
Option                                     QC modules          Description
========================================== =================== ===========
``chemistry=CHEMISTRY``                    cellranger[*]_count Explicitly set chemistry when running CellRanger 'count' (e.g. ``ARC-v1``)
``library=LIBRARY_TYPE``                   cellranger[*]_count Explicitly set library type in pipeline tasks (e.g. ``snRNA-seq``)
``cellranger_version=VERSION``             cellranger[*]_count Set expected CellRanger version for validation (default is the version identified by the pipeline; use ``*`` to match any version)
``cellranger_refdata=REFDATA``             cellranger[*]_count Set expected CellRanger reference dataset for validation (default is the reference identified by the pipeline; use ``*`` to match any reference dataset)                             
``set_cell_count=true|false``              cellranger[*]_count Whether to use outputs to set the cell count (default is ``true``)
``set_metadata=true|false``                cellranger[*]_count Whether to set metadata from this module (default is ``true``)
``cellranger_use_multi_config=true|false`` cellranger_count    If ``true`` then use CellRanger 'multi' config file to identify samples with biological data (default is ``false``)
========================================== =================== ===========