Running the QC standalone with ``run_qc.py``
============================================

The utility ``run_qc.py`` allows the QC pipeline to be run on an
arbitrary set of Fastqs outside of the ``auto_process`` pipeline.

The general invocation is:

::

   run_qc.py DIR | FASTQ [ FASTQ ... ]

If ``DIR`` is supplied then it should be a directory containing
the Fastq files to run the QC on:

::

   run_qc.py /mnt/data/project/fastqs.trimmed/

Alternatively a list of Fastq files can be supplied directly,
for example:

::

   run_qc.py /mnt/data/project/fastqs.trimmed/*.fastq

.. note::

   You can use more elaborate shell wildcard patterns to select
   the Fastqs to pass to the QC, for example:

   ::

      run_qc.py /mnt/data/project/fastqs.trimmed/{SC1_*,SC2_*}.trimmed.fastq

   would only select Fastq files matching ``SC1_*.trimmed.fastq``
   and ``SC2_*.trimmed.fastq``.

   See https://tldp.org/LDP/GNU-Linux-Tools-Summary/html/x11655.htm
   for more detail on using shell-style wildcards for pattern
   matching.

Various options are available to control the QC; the following
sections outline the most useful - see the documentation in the
:ref:`utilities_run_qc` section for the full set of options.

Specifying the QC metadata
--------------------------

The following options specify metadata for the QC which will
determine which metrics are run:

* ``--organism``: specify the organism(s) (e.g. ``human``,
  ``mouse`` etc; use the ``--info`` option to list the
  organisms available in the current configuration)
* ``--library-type``: specify the library type (e.g. ``RNA-seq``,
  ``ChIP-seq`` etc)
* ``--single-cell-platform``: specify the name for the single
  cell platform (e.g. ``10xGenomics Chromium 3'v3``; see
  :doc:`../control_files/projects_info` for a complete list,
  or use the ``--info`` option)
* ``--biological-samples``: explicitly specify the subset of
  samples with biological data (as opposed to e.g. feature
  barcode information) (by default all samples are treated as
  if they contained biological data)

.. note::

   By default ``run_qc.py`` attempts to locate a project
   metadata file associated with the input files, and will
   use the metadata defined there unless overridden by the
   options above; use the ``--ignore-metadata`` option to
   ignore the project metadata even if it is found.

Specifying the outputs
----------------------

By default the QC reports are written to the current directory,
with the QC outputs written to a ``qc`` subdirectory.

The following options can be used to override the defaults:

* ``-o``/``--out_dir``: sets the top-level directory where
  the QC reports and outputs are written
* ``--qc_dir``: sets the location where the QC outputs are
  written; if this is a relative path then it will be a
  subdirectory of the top-level output directory
* ``--filename``: name for the HTML report from the QC
* ``--name``: sets the name for the project (used in the
  QC report title)

Updating existing QC outputs
----------------------------

If the output directory for the QC already exists then by
default ``run_qc.py`` will stop without running the pipeline;
to override this behaviour and update an existing QC output
directory, specify the ``-u`` or ``--update`` option.

Specifying reference data
-------------------------

By default the reference data used in the QC pipeline (for
example the ``STAR`` index or the genome annotation) are
determined from the configuration file according to the
organism associated with the dataset.

If no matching reference data are found then the following
options are available:

* ``--star-index``: sets the path to the ``STAR`` index
  to use
* ``--gtf``: sets the path to the genome annotation in
  GTF format

Note that these options will also override the reference data
which are set in the configuration.

.. note::

   The ``--info`` option can be used to see what default
   reference data has been assigned to each configured
   organism name.

Running 10xGenomics single library analyses
-------------------------------------------

When running the QC pipeline on 10x Genomics data using an
:doc:`appropriate protocol <run_qc>`, additional options are
available for controlling the single library analyses that
are run using the ``count`` command of the appropriate
10x Genomics software package.

The following options can be used to override the defaults
defined in the configuration:

* ``--cellranger``: explicitly sets the path to the ``cellranger``
  (or other appropriate 10xGenomics package)
* ``--cellranger-reference``: sets the path to the reference
  dataset to use for single library analysis

For single cell RNA-seq additional options are available:

* ``--10x_force_cells``: explicitly specify the number of cells,
  overriding automatic cell detection algorithms (i.e. set the
  ``--force-cells`` option for Cellranger)
* ``--10x_chemistry``: explicitly set the assay configuration,
  overriding the automatic assay detection (i.e. set the
  ``--chemistry`` option for Cellranger)

.. note::

   The full outputs from the single library analyses are
   copied to the ``cellranger_count`` subdirectory of the
   top-level output directory, in addition to the metrics
   and HTML summary files written to the QC directory.

Running 10x Genomics CellRanger ``multi`` analysis
--------------------------------------------------

When running the QC pipeline on 10x Genomics data using an
:doc:`appropriate protocol <run_qc>`, additional options are
available for controlling the CellRanger ``multi`` analyses
(i.e. the ``cellranger multi`` command). (These options
supplement those available for the single library analyses
in the previous section.)

In order to enable the ``multi`` analyses, information on the
each of the 10x Genomics libraries associated with each physical
sample must be specified via the ``--10x_library`` option,
using the syntax:

::

   --10x_library [SAMPLE:]FASTQ_ID:FEATURE_TYPE

where ``SAMPLE`` is an optional physical sample name (required
if there is more than one physical sample), ``FASTQ_ID`` is
essentially the "sample name" part of the Fastq name, and
``FEATURE_TYPE`` is the associated feature type, e.g.
``Gene expression``).

For example:

::

   --10x_library "PB01:PB01_GEX:Gene Expression" --10x_library PB01:PB01_TCR:VDJ-T ...


There should be one ``--10x_library`` option supplied for each
Fastq ID.

``run_qc.py`` then uses the supplied information to build
``cellranger multi`` configuration files for each physical sample,
with the feature types assigned within the libraries section.

If the dataset has multiplexed samples (for example CellPlex or Flex
data) then their names can be assigned to the appropriate CMO or
probeset ID using the ``--10x_multiplexed_samples`` option:

::

   --10x_multiplexed_samples [SAMPLE:]MULTIPLEXED_SAMPLE=ID,...

where ``SAMPLE`` is an optional physical sample name (required
if there is more than one physical sample).

For example:

::

   --10x_multiplexed_samples PJB1:PBA=CMO301,PBB=CMO302

Two additional command options are provided specifically for V(D)J
and Flex data:

* ``--cellranger-vdj-reference``: if V(D)J data is present then
  this option should be used to specify the location of the V(D)J
  reference dataset;
* ``--cellranger-probest``: for Flex data this option can be used
  to specify the location of the probe set reference data
  (otherwise the probe set will be taken from the configuration,
  if possible).


Running on different platforms: ``--local``
-------------------------------------------

By default the QC pipeline will run using the settings from the
``auto_process`` configuration file; however it is recommended
to use the ``--local`` option if running the QC on a local
workstation, or within a job submitted to a compute cluster
(for example, if running inside another script).

For example: submitting a QC run as a single job on a Grid
Engine compute cluster might look like:

::

   qsub -b y -V -pe smp.pe 16 'run_qc.py --local /data/Fastqs'

In this mode the pipeline overrides the central configuration
and attempts to adjust parameters for running the QC to suit
the local setup.

It should make reasonable guesses for the number of available
CPUs and memory. However the following options can be used with
``--local`` to override the guesses:

* ``--maxcores``: sets the maximum number of CPUs available;
  the QC will not exceed this number when running jobs. If
  this isn't set explicitly then the pipeline will attempt to
  determine the number of CPUs automatically;
* ``--maxmem``: sets the maximum amount of memory available
  (in Gbs); currently this is only used if ``cellranger`` is
  being run. If this isn't set explicitly then the pipeline will
  attempt to determine the available memory automatically.

Explicitly specifying these parameters for a QC run submitted as
a single job on a Grid Engine compute cluster might look like:

::

   qsub -b y -V -pe smp.pe 16 'run_qc.py --local --maxcores=16 --maxmem=64 /data/Fastqs'

Listing organisms and other information: ``--info``
---------------------------------------------------

The ``--info`` option of ``run_qc.py`` displays various items
from the current configuration, including a list of the
available organisms and the indexes and other reference data
assigned to each.

Other information includes the available QC protocols, single
ell platforms and FastqScreen ``.conf`` files.

Once the information is displayed, ``run_qc.py`` will exit
without performing any further action.

Per-lane QC: ``--split-fastqs-by-lane``
---------------------------------------

The ``--split-fastqs-by-lane`` forces the pipeline to generate
copies of the input Fastqs for each lane that appears in the
read headers of each Fastq, and then run the QC on those
per-lane Fastqs (rather the originals, which are not changed).

This results in per-lane QC metrics, which can be useful for
diagnostic purposes when handling Fastqs which have been merged
across multiple lanes (for example, to determine whether
contanimation is confined to a single lane).

Specifying and customising the QC protocols
-------------------------------------------

If a QC protocol is not specified explicitly then by default
``run_qc.py`` will determine which protocol to use based on the
metadata; see :doc:`run_qc` for a complete list of the built-in
protocols (or use ``run_qc.py``'s ``--info`` option).

This behaviour can be overridden in a number of ways:

* The ``--protocol`` option can be used to explicitly specify
  one of the built-in QC protocols to use, or
* The ``--protocol-spec`` option can be used to fully specify
  a custom QC protocol using a specification string (see
  :doc:`qc_protocol_specification` for details of protocol
  specification strings), or
* A combination of the ``--sequence-reads``, ``--index-reads``
  and/or ``--qc-modules`` can be used to customise or specify
  the QC protocol.

The ``--sequence-reads`` and ``--index-reads`` options specify
which reads contain biological data and which contain only
"index" data; the ``--qc-modules`` option specify which QC
metrics will be generated and reported.

.. note::

   Details of the available QC modules can be found in
   :ref:`qc_protocols_qc_modules`.

.. note::

   Using ``--sequence-reads``, ``--index-reads`` and
   ``--qc-modules`` will override the relevant parts of the
   specified or automatically determined QC protocol.

Handling non-standard Fastq file names
--------------------------------------

By default, the pipeline expects Fastq names to be some variant
of the Illumina-style naming structure. Where a collection of
Fastq files have names that match one of these variants, the
standalone QC is able to pair R1 and R2 Fastqs and run the
appropriate QC protocol.

Where the Fastqs don't use an Illumina or psuedo-Illumina naming
format, the pipeline may fail to match R1/R2 read pairs correctly.
This may result in an inappropriate QC protocol being selected, or
the correct protocol failing to run correctly.

To handle these "non-standard" Fastq file names, it is possible to
supply a basic naming pattern to ``run_qc.py`` which will tell the
utility how to extract sample names and read numbers from the names,
via the ``--fastq-pattern`` option.

The pattern should be a string consisting of the following
elements:

* ``{SAMPLE}`` and ``{READ}``, indicating where the sample name
  and read numbers are expected;
* Fixed parts of the names which are present in all names;
* Wildcards (``*``) at positions where there may be variations
  between names which are not relevant to the sample or read
  numbers.

Some examples of patterns for different file name structures are
given in the table below.

  ===========================  =========================
  Pattern                      Example matching Fastqs
  ===========================  =========================
  ``{SAMPLE}_{READ}``          ``SMP1-1_1.fastq``, ``SMP1-1_2.fastq`` ...
  ``{SAMPLE}_R{READ}``         ``SMP1-1_R1.fastq``, ``SMP1-1_R2.fastq`` ...
  ``{SAMPLE}_S*_L*_R{READ}``   ``SMP1-1_S1_L002_R2_001.fq``, ...
  ``{SAMPLE}_ELK*_L*_{READ}``  ``SMP1_ELK250000280-1B_24EWNTLT3_L2_2.fq.gz``, ...
  ===========================  =========================

Using these patterns will enable the pipeline to correctly pair
Fastqs for the QC, and to use the appropriate sample names in the
reporting.