Fastq generation using auto_process make_fastqs
Overview
The make_fastqs
command is the backbone of the auto_process
pipeline. It is run after creating the analysis directory using the
setup command and performs the key step of generating
Fastq files from the raw BCL data produced by the sequencer.
The general invocation of the command is:
auto_process.py make_fastqs [--protocol=PROTOCOL] *options* [ANALYSIS_DIR]
By default make_fastqs
performs the following steps:
Fetches the BCL data and copies it to the
primary_data
subdirectory of the analysis directoryChecks the sample sheet for errors and possible barcode collisions
Runs BCL to Fastq conversion (either
bcl2fastq
orbcl-convert
, or the appropriate pipeline for for 10x Genomics single cell or spatial data e.g.cellranger
) and verifies that the outputs match what was expected from the input sample sheetGenerates statistics for the Fastq data and produces a report on the analysing the numbers of reads assigned to each sample, lane and project (
processing_qc.html
)Analyses the barcode index sequences to identify possible demultiplexing issues
Various options are available to skip or control each of these stages; more detail on the different usage modes can be found in the subsequent sections:
Information on the different Fastq generation protocols can be found in Fastq generation protocols, and some of the other useful options can be for the found in Commonly used options.
The outputs produced on successful completion are described below in the section Outputs; it is recommended to check the processing QC and barcode analysis reports which will highlight issues with the demultiplexing.
Once the Fastqs have been generated, the next step is to set up the project directories - see Setting up project directories.
Fastq generation protocols
The --protocol
option should be set according to the type of data
that are being processed:
Protocol option |
Used for |
---|---|
|
Standard Illumina sequencing data (default) |
|
miRNA-seq data |
|
10xGenomics Chromium single cell RNA-seq, CellPlex and Flex data |
|
10xGenomics Chromium single cell ATAC-seq data |
|
10xGenomics Visium spatial RNA-seq data |
|
10xGenomics Multiome single cell ATAC or GEX data only in single run |
|
10xGenomics Multiome single cell ATAC data (run with pooled GEX and ATAC data) |
|
10xGenomics Multiome single cell GEX data only (run with pooled GEX and ATAC data) |
|
Parse Evercode single cell data |
|
ICELL8 single-cell RNA-seq data |
|
ICELL8 single-cell ATAC-seq data |
Typically the standard
protocol is sufficient for most types of
data (RNA-seq, ATAC-seq, ChIP-seq, metagenomics etc) where
the sample sheet contains Illumina index sequences.
Note
For data where the sequences are expected to be very short (such
as miRNA-seq data), the mirna
protocol should be used instead -
this is the same as the standard
protocol but adjusts the
adapter trimming and masking options as follows:
Sets the minimum trimmed read length to 10 bases (default is 35 bases in
standard
mode)Turns off short read masking by setting the threshold length to zero (default is 22 in
standard
mode)
More details about adapter trimming and short read masking can be found in the section Configuring adapter trimming and masking.
For other types of data (typically single cell and spatial), refer to the appropriate section of the documentation for more details of which Fastq generation protocols should be used:
Commonly used options
Some of the most commonly used options are:
--protocol
: specifies the Fastq generation protocol--output-dir
: specifies the directory to write the output Fastqs to (defaults tobcl2fastq
)--sample-sheet
: specifies a non-default sample sheet file to use (defaults tocustom_SampleSheet.csv
; the new sample sheet file will become the default for subsequent runs)--lanes
: allows a subset of lanes to be processed (useful for multi-lane sequencers when samples with a mixture of processing protocols have been run). Lanes can be specified as a range (e.g.1-4
), a list (e.g.6,8
) or a combination (e.g.1-4,6,8
). See Fastq generation for runs with mixed protocols and options for more details--bcl-converter
: allows the Illumina Fastq generation software to be specified, see Specifying Illumina BCL conversion software for more details--use-bases-mask
: allows a custom bases mask string (which controls how each cycle of raw data is used) to be specified (default is to determine the bases mask automatically; set toauto
to restore this behaviour)--platform
: if the sequencer platform cannot be identified from the instrument name it can be explicitly specified using this option (see Sequencers and platforms for how to associate sequencers and platforms in the configuration)--no-barcode-analysis
skips the barcode analysis for standard runs--no-stats
skips the generation of statistics and processing QC reporting
The full set of options can be found in the ‘make_fastqs’ section of the command reference.
Truncating R1/R2 read lengths and setting bases mask
In some cases it may be desirable to truncate the lengths of the
non-index reads, most typically for the R1 and/or R2 sequences.
In these cases the --r1-length
and/or --r2-length
options
can be used to specify the maximum length for one or both of the
R1 and R2 reads.
For example:
auto_process.py make_fastqs --r1-length=28
would result in R1 sequences with a maximum length of 28bp.
Maximum read lengths can also be applied to a subset of lanes
via the --lanes
option, for example:
auto_process.py make_fastqs --lanes=1-2:standard:r1_length=28
Note
The --r1-length
and --r2-length
options are only
applied for the standard
and mirna
protocols; they
are ignored for other protocols.
The options operate by adjusting the bases mask used to match the required length, so if a bases mask is explicitly provided then these options will also not be applied.
Alternatively (or in scenarios where more complicated
read manipulations are required), the bases mask can be
explicitly specified via the --use-bases-mask
option;
for example:
y28n48,I8,I8,y76
would also truncate R1 sequences to the first 28bp.
Configuring adapter trimming and masking
By default Fastq generation includes adapter trimming and masking of
short reads via bcl2fastq
.
Adapter sequences used for trimming are taken from those specified
in the input sample sheet, but these can be overriden by using the
--adapter
and --adapter-read2
options to specify different
sequences.
Adapter trimming can be disabled by specifying the
--no-adapter-trimming
option (or by setting both adapter
sequences to empty strings).
When adapter trimming is performed two additional operations are applied:
Minium read length is enforced for reads which are shorter than this length after trimming, by padding them with N’s up to the minimum length
Masking of short reads is performed for reads below a masking threshold length, by masking all bases in the read with N’s
Minimum read length defaults to 35 bases but can set explicitly by
using the --minimum-trimmed-read-length
option; the masking
threshold defaults to 22 bases but can be set using the
--mask-short-adapter-reads
option. Set this to zero to turn
off masking.
Warning
Setting the minimum read length to zero when using adapter trimming can result in read records with zero-length sequences, which may cause problems in downstream analyses.
Fastq generation for runs with mixed protocols and options
Multi-lane instruments such as the HiSeq platform provide the option to run mixtures of samples requiring different processing protocols in a single sequencing run, for example:
Samples in some lanes have different barcode index characteristics (e.g. different lengths) to those in other lanes
Some lanes contain standard samples whilst others contain 10xGenomics or ICELL8 single-cell samples
make_fastqs
is able to process these in a single run provided
that:
the sample sheet has the appropriate index sequences for each lane (for example, truncating index sequences, or inserting the appropriate 10xGenomics indexes); and
where different protocols or processing options need to be specified for groups of lanes, that these are specified via multiple
--lanes
options.
make_fastqs
will process each set of lanes separately
before combining them into a single output directory at the
end.
For example: say we have a HiSeq run with non-standard samples in lanes 5 and 6, and standard samples in all other lanes.
If the samples in lanes 5 and 6 have different barcode lengths to those in the other lanes, but should otherwise be treated the same, then the following command line would be sufficient to handle this:
auto_process.py make_fastqs \
--sample-sheet=SampleSheet.updated.csv
However if the samples in lanes 5 and 6 were 10xGenomics
Chromium single cell data, then it is necessary to explicitly
specify which lanes to group together and how each group should
be handled. This is done using the --lanes
option to
indicate that the 10x_chromium_sc
protocol should be used
with lanes 5 and 6, and that the standard
protocol should
be used with the other lanes:
auto_process.py make_fastqs \
--lanes=1-4,7-8:standard \
--lanes=5,6:10x_chromium_sc \
--sample-sheet=SampleSheet.updated.csv
Note
If the --lanes
option is used one or more times then
only those lanes explicitly listed will be processed.
Lanes that aren’t specified will be excluded from the
processing.
More generally it’s possible to set multiple options on a set of lanes using the lanes option, for example to explicitly specify the adapter sequences for lane 8:
auto_process.py make_fastqs \
--lanes=1-7 \
--lanes=8:adapter=CTGTCTCTTATACACATCT \
--sample-sheet=SampleSheet.updated.csv
The general form of the --lanes
option is:
--lanes=LANES[:protocol][:OPTION=VALUE[:OPTION=VALUE...]]
The available options are:
Option |
Description |
---|---|
|
Set bases mask |
|
Truncate R1 reads to |
|
Truncate R2 reads to |
|
Turn adapter trimming on or off |
|
Set adapter sequence for trimming |
|
Set read2 adapter sequence |
|
Set minimum trimmed read length |
|
Set minimum read length below which sequences are masked |
|
Set |
|
Set |
|
Set |
|
Well list file ( |
|
Turn I1/I2 swapping on or off
( |
|
Set reverse complementing option
( |
|
Turn barcode analysis on or off |
These options will override the defaults and any global values set by the top-level options.
It is also possible to process subsets of lanes manually, and
then use the merge_fastq_dirs
, update_fastq_stats
and
analyse_barcodes
commands to combine and analyse the Fastqs.
For example, for the mixture of standard and 10xGenomics samples previously described this might look like:
# Process lanes 1-4,7-8 (standard samples)
auto_process.py make_fastqs \
--lanes=1-4,7-8 \
--sample-sheet=SampleSheet.updated.csv \
--output-dir=bcl2fastq.L123478 \
--use-bases-mask=auto \
--no-barcode-analysis \
--no-stats
# Process lanes 5-6 (10xGenomics samples)
auto_process.py make_fastqs \
--lanes=5-6 \
--sample-sheet=SampleSheet.updated.csv \
--protocol=10x_chromium_sc \
--output-dir=bcl2fastq.L56 \
--use-bases-mask=auto \
--no-stats
# Combine outputs
auto_process.py merge_fastq_dirs \
--primary-unaligned-dir=bcl2fastq.L123478 \
--output-dir=bcl2fastq
# Generate statistics
auto_process.py update_fastq_stats
# Analyse barcodes (standard samples only)
auto_process.py analyse_barcodes --lanes=1-4,7-8
See the appropriate sections of the command reference for the full set of available options:
Processing a single run multiple times
Sometimes it is necessary to process a single run multiple times, (for example, to try different parameter sets) while keeping the outputs from each processing attempt in the same analysis directory.
The --id
option of the make_fastqs
command can be used to
facilitate this, by allowing an identifier (e.g. no_trimming
)
to be supplied which will then be appended to the outputs from the
Fastq generation (including the output directories holding the
generated Fastqs, the barcode analysis directories, and the
statistics and processing report files).
For example:
auto_process.py make_fastqs --id=no_trimming --no-adapter-trimming
would produce bcl2fastq_no_trimming
, barcodes_no_trimming
,
statistics_no_trimming.info
and so on.
Note
The --id
option of the setup_analysis_dirs
command
can be used to create projects which carry the same identifier,
see Adding an identifier to project directory names.
Note
A simpler alternative is to set up a completely new parallel analysis directory for reprocessing, and expliciting assigning a unique analysis number to distinguish it from other analysis attempts.
This can be done via the setup
command using the -n
option (see Specifying the analysis number), or by
setting the analysis_number
metadata item within an existing
analysis directory.
Specifying Illumina BCL conversion software
For the standard
and mirna
Fastq generation protocols,
it is possible to use either the bcl2fastq
or bcl-convert
software packages to convert raw BCL data into Fastq files.
The --bcl-converter
command line option can be used to
specify both the BCL converter software and optionally also
restrict to a range (or single version), for example:
auto_process.py make_fastqs --bcl-converter 'bcl-convert>=3.7'
Default BCL conversion software can be specified in the config file, both generally and on a per-platform basis (see Specifying BCL to Fastq conversion software and options).
Outputs
On completion the make_fastqs
command will produce:
An output directory called
bcl2fastq
with the demultiplexed Fastq files (see below for more detail)A set of tab-delimited files with statistics on each of the Fastq files
An HTML report on the processing QC (see the section on Processing QC reports)
A projects.info metadata file which is used by the setup_analysis_dirs command when setting up analysis project directories (see Setting up project directories)
For standard runs there will additional outputs:
A directory called
barcode_analysis
which will contain reports with analysis of the barcode index sequences (see the section on Barcode analysis)
If the run included 10xGenomics Chromium 3’ data then there will be some additional outputs:
A report in the top-level analysis directory called
cellranger_qc_summary[_LANES].html
, which is an HTML copy of the QC summary JSON file produced bycellranger mkfastq
(nbLANES
will be the subset of lanes from the run which contained the Chromium data, if the run consisted of a mixture of Chromium and non-Chromium samples, for example:--lanes=5,6
results in56
).
Note
The processing QC reports can be copied to the QC server using the publish_qc command.
Output Fastq files
Each sample defined in the input sample sheet will produce one or more output Fastq files, depending on:
if the run was single- or paired-end,
whether the sample appeared in more than one lane, and
whether the
--no-lane-splitting
option was specified
By default if samples appear in more than one lane in a sequencing
run then make_fastqs
will generate multiple Fastqs with
each Fastq only containing reads from a single lane, and with
the lane number appearing in the Fastq file name.
However if the --no-lane-splitting
option is specified then
the reads from all lanes that the sample appeared in will be
combined into the same Fastq file.
The default lane splitting behaviour can be controlled via the
configuration options in the auto_process.ini
file (see
configuration).
Note
Lane splitting is always performed for 10xGenomics single cell
data, regardless of the settings or options supplied to
make_fastqs
.