Fastq generation using ``auto_process make_fastqs``
===================================================

Overview
--------

The ``make_fastqs`` command is the backbone of the ``auto_process``
pipeline. It is run after creating the analysis directory using the
:doc:`setup <setup>` command and performs the key step of generating
Fastq files from the raw BCL data produced by the sequencer.

The general invocation of the command is:

::

   auto_process.py make_fastqs [--protocol=PROTOCOL] *options* [ANALYSIS_DIR]

By default ``make_fastqs`` performs the following steps:

* Fetches the BCL data and copies it to the ``primary_data`` subdirectory
  of the analysis directory
* Checks the sample sheet for errors and possible barcode collisions
* Runs BCL to Fastq conversion (either ``bcl2fastq`` or ``bcl-convert``,
  or the appropriate pipeline for  for 10x Genomics single cell or spatial
  data e.g. ``cellranger``) and verifies that the outputs match what was
  expected from the input sample sheet
* Generates statistics for the Fastq data and produces a report on the
  analysing the numbers of reads assigned to each sample, lane and
  project (``processing_qc.html``)
* Analyses the barcode index sequences to identify possible demultiplexing
  issues

Various options are available to skip or control each of these stages;
more detail on the different usage modes can be found in the
subsequent sections:

* :ref:`make_fastqs-truncating-read-lengths`
* :ref:`make_fastqs-adapter-trimming-and-masking`
* :ref:`make_fastqs-mixed-protocols`
* :ref:`make_fastqs-bcl-converter`

Information on the different Fastq generation protocols can be found in
:ref:`make_fastqs-protocols`, and some of the other useful options can be
for the found in :ref:`make_fastqs-commonly-used-options`.

The outputs produced on successful completion are described below
in the section :ref:`make_fastqs-outputs`; it is recommended to check
the :doc:`processing QC <../output/processing_qc>` and
:doc:`barcode analysis <../output/barcode_analysis>` reports which
will highlight issues with the demultiplexing.

Once the Fastqs have been generated, the next step is to set up the
project directories - see
:doc:`Setting up project directories <setup_analysis_dirs>`.

.. _make_fastqs-protocols:

Fastq generation protocols
--------------------------

The ``--protocol`` option should be set according to the type of data
that are being processed:

.. include:: ../auto/fq_protocols.rst

Typically the ``standard`` protocol is sufficient for most types of
data (RNA-seq, ATAC-seq, ChIP-seq, metagenomics etc) where
the sample sheet contains Illumina index sequences.

.. note::

   For data where the sequences are expected to be very short (such
   as miRNA-seq data), the ``mirna`` protocol should be used instead -
   this is the same as the ``standard`` protocol but adjusts the
   adapter trimming and masking options as follows:

   * Sets the minimum trimmed read length to 10 bases (default is
     35 bases in ``standard`` mode)
   * Turns off short read masking by setting the threshold length
     to zero (default is 22 in ``standard`` mode)

   More details about adapter trimming and short read masking can be
   found in the section :ref:`make_fastqs-adapter-trimming-and-masking`.

For other types of data (typically single cell and spatial), refer to
the appropriate section of the documentation for more details of which
Fastq generation protocols should be used:

* :doc:`10x Genomics single cell data <../single_cell/10x_single_cell>`
* :doc:`10x Genomics spatial data <../spatial/10x_visium>`
* :doc:`Parse Evercode single cell data <../single_cell/parse>`
* :doc:`Bio-Rad single cell data <../single_cell/bio_rad>`

.. _make_fastqs-commonly-used-options:

Commonly used options
---------------------

Some of the most commonly used options are:

* ``--protocol``: specifies the Fastq generation protocol
* ``--output-dir``: specifies the directory to write the output
  Fastqs to (defaults to ``bcl2fastq``)
* ``--sample-sheet``: specifies a non-default sample sheet file
  to use (defaults to ``custom_SampleSheet.csv``; the new sample
  sheet file will become the default for subsequent runs)
* ``--lanes``: allows a subset of lanes to be processed (useful
  for multi-lane sequencers when samples with a mixture
  of processing protocols have been run). Lanes can be specified
  as a range (e.g. ``1-4``), a list (e.g. ``6,8``) or a
  combination (e.g. ``1-4,6,8``). See
  :ref:`make_fastqs-mixed-protocols` for more details
* ``--bcl-converter``: allows the Illumina Fastq generation
  software to be specified, see :ref:`make_fastqs-bcl-converter`
  for more details
* ``--use-bases-mask``: allows a custom bases mask string (which
  controls how each cycle of raw data is used) to be specified
  (default is to determine the bases mask automatically; set to
  ``auto`` to restore this behaviour)
* ``--platform``: if the sequencer platform cannot be identified
  from the instrument name it can be explicitly specified using
  this option (see :ref:`config_sequencer_platforms` for how to
  associate sequencers and platforms in the configuration)
* ``--no-barcode-analysis`` skips the barcode analysis for
  standard runs
* ``--no-stats`` skips the generation of statistics and processing
  QC reporting

The full set of options can be found in the
:ref:`'make_fastqs' <commands_make_fastqs>` section of the command
reference.

.. _make_fastqs-truncating-read-lengths:

Truncating R1/R2/R3 read lengths and setting bases mask
-------------------------------------------------------

In some cases it may be desirable to truncate the lengths of the
non-index reads, most typically for the R1 and/or R2 sequences.
In these cases the ``--r1-length`` and/or ``--r2-length`` options
can be used to specify the maximum length for one or both of the
R1 and R2 reads.

For example:

::

  auto_process.py make_fastqs --r1-length=28

would result in R1 sequences with a maximum length of 28bp.

Maximum read lengths can also be applied to a subset of lanes
via the ``--lanes`` option, for example:

::

   auto_process.py make_fastqs --lanes=1-2:standard:r1_length=28

.. note::

   For datasets which define an R3 read (for example, 10x
   Genomics single cell ATAC data), the ``--r3-length`` option
   is available to explicitly truncate the length of the R3
   reads (with an ``r3_length=...`` option available for lane
   subsets).

.. note::

   The read truncation options operate by adjusting the bases
   mask used to match the required length, so if a bases mask
   is explicitly provided then these options will also not be
   applied.

   Alternatively (or in scenarios where more complicated
   read manipulations are required), the bases mask can be
   explicitly specified via the ``--use-bases-mask`` option;
   for example:

   ::

      y28n48,I8,I8,y76

   would also truncate R1 sequences to the first 28bp.

.. _make_fastqs-adapter-trimming-and-masking:

Configuring adapter trimming and masking
----------------------------------------

By default Fastq generation includes adapter trimming and masking of
short reads via ``bcl2fastq``.

Adapter sequences used for trimming are taken from those specified
in the input sample sheet, but these can be overriden by using the
``--adapter`` and ``--adapter-read2`` options to specify different
sequences.

Adapter trimming can be disabled by specifying the
``--no-adapter-trimming`` option (or by setting both adapter
sequences to empty strings).

When adapter trimming is performed two additional operations are
applied:

* **Minium read length** is enforced for reads which are shorter
  than this length after trimming, by padding them with N's
  up to the minimum length
* **Masking of short reads** is performed for reads below a
  masking threshold length, by masking *all* bases in the read
  with N's

Minimum read length defaults to 35 bases but can set explicitly by
using the ``--minimum-trimmed-read-length`` option; the masking
threshold defaults to 22 bases but can be set using the
``--mask-short-adapter-reads`` option. Set this to zero to turn
off masking.

.. warning::

   Setting the minimum read length to zero when using adapter
   trimming can result in read records with zero-length sequences,
   which may cause problems in downstream analyses.

.. _make_fastqs-mixed-protocols:

Fastq generation for runs with mixed protocols and options
----------------------------------------------------------

Multi-lane instruments such as the HiSeq platform provide the
option to run mixtures of samples requiring different processing
protocols in a single sequencing run, for example:

* Samples in some lanes have different barcode index
  characteristics (e.g. different lengths) to those in
  other lanes
* Some lanes contain standard samples whilst others contain
  single cell or spatial samples

``make_fastqs`` is able to process these in a single run provided
that:

* the sample sheet has the appropriate index sequences for
  each lane (for example, truncating index sequences, or
  inserting the appropriate 10xGenomics indexes); and
* where different protocols or processing options need to
  be specified for groups of lanes, that these are specified
  via multiple ``--lanes`` options.

``make_fastqs`` will process each set of lanes separately
before combining them into a single output directory at the
end.

For example: say we have a HiSeq run with non-standard samples
in lanes 5 and 6, and standard samples in all other lanes.

If the samples in lanes 5 and 6 have different barcode lengths
to those in the other lanes, but should otherwise be treated
the same, then the following command line would be sufficient
to handle this:

::

   auto_process.py make_fastqs \
	    --sample-sheet=SampleSheet.updated.csv

However if the samples in lanes 5 and 6 were 10xGenomics
Chromium single cell data, then it is necessary to explicitly
specify which lanes to group together and how each group should
be handled. This is done using the ``--lanes`` option to
indicate that the ``10x_chromium_sc`` protocol should be used
with lanes 5 and 6, and that the ``standard`` protocol should
be used with the other lanes:

::

   auto_process.py make_fastqs \
            --lanes=1-4,7-8:standard \
	    --lanes=5,6:10x_chromium_sc \
	    --sample-sheet=SampleSheet.updated.csv


.. note::

   If the ``--lanes`` option is used one or more times then
   only those lanes explicitly listed will be processed.
   Lanes that aren't specified will be excluded from the
   processing.

More generally it's possible to set multiple options on a
set of lanes using the lanes option, for example to explicitly
specify the adapter sequences for lane 8:

::

   auto_process.py make_fastqs \
            --lanes=1-7 \
	    --lanes=8:adapter=CTGTCTCTTATACACATCT \
	    --sample-sheet=SampleSheet.updated.csv

The general form of the ``--lanes`` option is:

::

   --lanes=LANES[:protocol][:OPTION=VALUE[:OPTION=VALUE...]]

The available options are:

===================================== ==================================
Option                                Description
===================================== ==================================
``bases_mask=BASES_MASK``             Set bases mask
``r1_length=LENGTH``                  Truncate R1 reads to ``LENGTH``
``r2_length=LENGTH``                  Truncate R2 reads to ``LENGTH``
``r3_length=LENGTH``                  Truncate R3 reads to ``LENGTH``
``trim_adapters=yes|no``              Turn adapter trimming on or off
``adapter=SEQUENCE``                  Set adapter sequence for trimming
``adapter_read2=SEQUENCE``            Set read2 adapter sequence
``minimum_trimmed_read_length=N``     Set minimum trimmed read length
``mask_short_adapter_reads=N``        Set minimum read length below
                                      which sequences are masked
``tenx_filter_single_index=yes|no``   Set ``--filter-single-index``
                                      option for ``cellranger``
                                      or ``cellranger-arc``
``tenx_filter_dual_index=yes|no``     Set ``--filter-dual-index``
                                      option for ``cellranger``
                                      or ``cellranger-arc``
``spaceranger_rc_i2_override=BOOL``   Set ``--rc-i2-override`` option
                                      for ``spaceranger`` (can be
                                      either ``true`` or ``false``)
``analyse_barcodes=yes|no``           Turn barcode analysis on or off
===================================== ==================================

These options will override the defaults and any global values
set by the top-level options.

It is also possible to process subsets of lanes manually, and
then use the ``merge_fastq_dirs``, ``update_fastq_stats`` and
``analyse_barcodes`` commands to combine and analyse the Fastqs.

For example, for the mixture of standard and 10xGenomics samples
previously described this might look like:

::

   # Process lanes 1-4,7-8 (standard samples)
   auto_process.py make_fastqs \
            --lanes=1-4,7-8 \
	    --sample-sheet=SampleSheet.updated.csv \
            --output-dir=bcl2fastq.L123478 \
            --use-bases-mask=auto \
            --no-barcode-analysis \
	    --no-stats

   # Process lanes 5-6 (10xGenomics samples)
   auto_process.py make_fastqs \
            --lanes=5-6 \
	    --sample-sheet=SampleSheet.updated.csv \
	    --protocol=10x_chromium_sc \
            --output-dir=bcl2fastq.L56 \
            --use-bases-mask=auto \
	    --no-stats

   # Combine outputs
   auto_process.py merge_fastq_dirs \
             --primary-unaligned-dir=bcl2fastq.L123478 \
	     --output-dir=bcl2fastq

   # Generate statistics
   auto_process.py update_fastq_stats

   # Analyse barcodes (standard samples only)
   auto_process.py analyse_barcodes --lanes=1-4,7-8

See the appropriate sections of the command reference for
the full set of available options:

* :ref:`commands_merge_fastq_dirs`
* :ref:`commands_update_fastq_stats`
* :ref:`commands_analyse_barcodes`

.. _make_fastqs-processing-same-run-multiple-times:

Processing a single run multiple times
--------------------------------------

Sometimes it is necessary to process a single run multiple times,
(for example, to try different parameter sets) while keeping the
outputs from each processing attempt in the same analysis
directory.

The ``--id`` option of the ``make_fastqs`` command can be used to
facilitate this, by allowing an identifier (e.g. ``no_trimming``)
to be supplied which will then be appended to the outputs from the
Fastq generation (including the output directories holding the
generated Fastqs, the barcode analysis directories, and the
statistics and processing report files).

For example:

::

   auto_process.py make_fastqs --id=no_trimming --no-adapter-trimming

would produce ``bcl2fastq_no_trimming``, ``barcodes_no_trimming``,
``statistics_no_trimming.info`` and so on.

.. note::

   The ``--id`` option of the ``setup_analysis_dirs`` command
   can be used to create projects which carry the same identifier,
   see :ref:`setup_analysis_dirs-add-identifier`.

.. note::

   A simpler alternative is to set up a completely new parallel
   analysis directory for reprocessing, and expliciting assigning
   a unique analysis number to distinguish it from other analysis
   attempts.

   This can be done via the ``setup`` command using the ``-n``
   option (see :ref:`setup_specifying_analysis_run_number`), or by
   setting the ``analysis_number`` metadata item within an existing
   analysis directory.

.. _make_fastqs-bcl-converter:

Specifying Illumina BCL conversion software
-------------------------------------------

For the ``standard`` and ``mirna`` Fastq generation protocols,
it is possible to use either the ``bcl2fastq`` or ``bcl-convert``
software packages to convert raw BCL data into Fastq files.

The ``--bcl-converter`` command line option can be used to
specify both the BCL converter software and optionally also
restrict to a range (or single version), for example:

::

   auto_process.py make_fastqs --bcl-converter 'bcl-convert>=3.7'

Default BCL conversion software can be specified in the config
file, both generally and on a per-platform basis (see
:ref:`specifying_bcl_conversion_software`).

.. _make_fastqs-outputs:

Outputs
-------

On completion the ``make_fastqs`` command will produce:

* An output directory called ``bcl2fastq`` with the demultiplexed
  Fastq files (see below for more detail)
* A set of tab-delimited files with statistics on each of the
  Fastq files
* An HTML report on the processing QC (see the section on
  :doc:`Processing QC reports <../output/processing_qc>`)
* A :doc:`projects.info <../control_files/projects_info>` metadata
  file which is used by the :doc:`setup_analysis_dirs <setup_analysis_dirs>`
  command when setting up analysis project directories (see
  :doc:`Setting up project directories <setup_analysis_dirs>`)

For standard runs there will additional outputs:

* A directory called ``barcode_analysis`` which will contain
  reports with analysis of the barcode index sequences (see the
  section on :doc:`Barcode analysis <../output/barcode_analysis>`)

If the run included 10xGenomics Chromium 3' data then there will
be some additional outputs:

* A report in the top-level analysis directory called
  ``cellranger_qc_summary[_LANES].html``, which is an HTML copy
  of the QC summary JSON file produced by ``cellranger mkfastq``
  (nb ``LANES`` will be the subset of lanes from the
  run which contained the Chromium data, if the run consisted
  of a mixture of Chromium and non-Chromium samples, for example:
  ``--lanes=5,6`` results in ``56``).

.. note::

   The processing QC reports can be copied to the QC server using
   the :doc:`publish_qc command <publish_qc>`.

Output Fastq files
******************

Each sample defined in the input sample sheet will produce one
or more output Fastq files, depending on:

* if the run was single- or paired-end,
* whether the sample appeared in more than one lane, and
* whether the ``--no-lane-splitting`` option was specified

By default if samples appear in more than one lane in a sequencing
run then ``make_fastqs`` will generate multiple Fastqs with
each Fastq only containing reads from a single lane, and with
the lane number appearing in the Fastq file name.

However if the ``--no-lane-splitting`` option is specified then
the reads from all lanes that the sample appeared in will be
combined into the same Fastq file.

The default lane splitting behaviour can be controlled via the
configuration options in the ``auto_process.ini`` file (see
:doc:`configuration <../configuration>`).

.. note::

   Lane splitting is always performed for 10xGenomics single cell
   data, regardless of the settings or options supplied to
   ``make_fastqs``.