Requirements
Supported Python versions
The package consists predominantly of code written in Python, and the following versions are supported:
Python 3.6
Python 3.7
Python 3.8
Software dependencies
The following auto_process
subcommands depend on additional
third-party software packages which must be installed separately:
Pipeline stage |
Software packages |
Notes |
---|---|---|
make_fastqs |
2.17+ recommended |
|
make_fastqs |
Alternative to |
|
make_fastqs |
10xGenomics Chromium single-cell RNA-seq data only |
|
make_fastqs |
10xGenomics Chromium single-cell ATAC-seq data only |
|
make_fastqs |
10xGenomics Multiome ATAC + GEX data |
|
make_fastqs |
10xGenomics Visium spatial RNA-seq data only |
|
run_qc (*) |
||
run_qc (*) |
||
run_qc (*) |
Required by fastq_screen |
|
run_qc (*) |
Required for strandedness and alignment |
|
run_qc (*) |
Required for insert size metrics |
|
run_qc (*) |
Required for gene body coverage |
|
run_qc (*) |
Required for per-Fastq genomic origin of reads etc |
|
run_qc |
10xGenomics Chromium single-cell RNA-seq data only |
|
run_qc |
10xGenomics single-cell ATAC-seq data only |
|
run_qc |
10xGenomics Multiome ATAC + GEX data |
|
run_qc (*) |
||
run_qc (*) |
Required for protocols mapping subsequences of reads (e.g. |
|
process_icell8 |
||
process_icell8 |
||
process_icell8 |
Required by fastq_screen |
(*) indicates packages that only need to be installed if
Using conda to resolve pipeline dependencies hasn’t been enabled in the
configuration (or by an appropriate command line option); otherwise
the programs provided by these packages must be available on the
PATH
when the appropriate autoprocessor commands are issued.
Using environment modules can be used to help manage this.
Alternatively many of these packages can be obtained from the
bioconda project.
Note
Fastq generation requires Illumina’s bcl2fastq
software.
The recommended version is 2.17+ (earlier versions should work
but note that they cannot handle NextSeq data); if there are
multiple bcl2fastq
packages available on the path at run
time then see required_bcl2fastq_versions for how to
specify which version is used.
Reference data
The following auto_process
stages require additional reference
data:
run_qc
Reference data required for the calculation of various QC metrics are described in the sections below, along with the configuration settings needed to make them available to the QC pipeline.
FastqScreen
FastqScreen requires one or more conf
files (each of which
defines a specific “screen”) along with the underlying bowtie
indexes for each organism which are included in the screens.
Indexes can be created manually, or by using the build_index.py
utility (see Building indexes for aligners); the conf
files must be
created manually (see the
FastqScreen documentation);
each screen can then be added to the configuration using a
screen
section which references the corresponding conf
file, e.g.:
[screen:model_organisms]
conf_file = /data/model_organisms.conf
The screens to use in the pipeline must be set using the
fastq_screens
parameter in the qc
section, e.g.:
[qc]
fastq_screens = model_organisms,other_organisms,rRNA
...
Note
This replaces the old qc.setup
script that was used
to define the location of a set of standard screen conf
files, used in earlier versions of the pipeline. Note
that qc.setup
is not longer needed (and will be ignored
if present).
Strandedness
Strandedness determination requires STAR
indexes for each
organism of interest. These can be defined using appropriate
settings in [organism:...]
sections of the auto_process.ini
file, for example:
[organism: human]
star_index = /data/genomeIndexes/hg38/STAR/
[organism: mouse]
star_index = /data/genomeIndexes/mm10/STAR/
Indexes can be created manually, or by using the
build_index.py
utility (see Building indexes for aligners).
Note
The [organism:...]
sections supersede the old
fastq_strand_indexes
section of the auto_process.ini
file; the older section is still recognised for now but is
deprecated and likely to be dropped in future.
Insert size metrics (Picard)
Picard’s CollectInsertSizeMetrics
needs a STAR index for
each organism of interest (in order to generate a BAM file from
the sequences). This should be specfied in the [organism:...]
sections of the auto_process.ini
configuration file, for example:
[organism: human]
star_index = /data/genomeIndexes/hg38/STAR/
STAR indexes can be created manually, or by using the
build_index.py
utility (see Building indexes for aligners).
RSeQC gene body coverage
RSeQC geneBody_coverage.py
needs both a STAR index (in order
to generate a BAM file from the sequences) and gene annotation in
BED format, for each organism of interest. These should be specfied
in the [organism:...]
sections of the auto_process.ini
configuration file, for example:
[organism: human]
star_index = /data/genomeIndexes/hg38/STAR/
annotation_bed = /data/genomeIndexes/hg38/hg38.HouseKeepingGenes.bed
Note
STAR indexes can be created manually, or by using the
build_index.py
utility (see Building indexes for aligners). Suitable
gene model files for human and mouse can be downloaded from
the RSeQC webpages at
http://rseqc.sourceforge.net/#download-gene-models-update-on-12-14-2021
Qualimap RNA-seq metrics
Qualimap’s rnaseq
command a STAR index (in order to generate a BAM
file from the sequences) and gene annotation in GTF format, for each
organism of interest. The pipeline uses RSeQC’s infer_experiment.py
command to determine strand specificity for input to Qualimap.
All these should be specfied in the [organism:...]
sections of the
auto_process.ini
configuration file, for example:
[organism: human]
star_index = /data/genomeIndexes/hg38/STAR/
annotation_gtf = /data/genomeIndexes/hg38/gencode.v40.annotation.gtf
STAR indexes can be created manually, or by using the build_index.py
utility (see Building indexes for aligners).
Single cell analyses
Single library analyses of 10xGenomics single cell data require
the appropriate compatible reference datasets for
cellranger[-atac|-arc] count
:
scRNA-seq data: transcriptome reference data set
snRNA-seq data: “pre-mRNA” reference data set (which includes both intronic and exonic information)
sc/snATAC-seq: Cell Ranger ATAC compatible genome reference
single cell multiome GEX+ATAC data:
cellranger-arc
compatible reference package
These can all be defined using appropriate settings in
[organism:...]
sections of the auto_process.ini
file,
for example:
[organism: human]
cellranger_reference = /data/10x/refdata-cellranger-GRCh38-1.2.0
cellranger_premrna_reference = /data/10x/refdata-cellranger-GRCh38-1.2.0_premrna
cellranger_atac_reference = /data/10x/refdata-cellranger-atac-GRCh38-1.0.1
cellranger_arc_reference = /data/10x/refdata-cellranger-arc-GRCh38-2020-A
[organism: mouse]
cellranger_reference = /data/10x/refdata-cellranger-mm10-1.2.0
cellranger_atac_reference = /data/10x/refdata-cellranger-atac-mm10-1.0.1
cellranger_arc_reference = /data/10x/refdata-cellranger-arc-mm10-2020-A
Note
Alternatively reference data sets can be specified at run-time
for single cell and single nuclei RNA-seq using the
--10x_transcriptome
and --10x_premrna_reference
command line options respectively with the run_qc
command
and the run_qc.py
utility.
10xGenomics provide a number of reference data sets for scRNA-seq, ATAC-seq and single cell multiome data, which can be downloaded via:
https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/installation
https://support.10xgenomics.com/single-cell-atac/software/pipelines/latest/installation
https://support.10xgenomics.com/single-cell-multiome-atac-gex/software/pipelines/latest/installation
There are also instructions for constructing reference data for novel organisms that are not supported by 10xGenomics.
Pre-mRNA references are currently not available, but the documentation explains how to generate a custom reference package for these data:
Note
The [organism:...]
sections supersede the old
10xgenomics...
sections of the auto_process.ini
file;
the old sections are still recognised for now but are
deprecated and likely to be dropped in future.
10x Genomics fixed RNA profiling (Flex) analyses
Analysis of 10xGenomics single cell fixed RNA profiling data
(“Flex”) uses cellranger multi
and requires:
Reference transcriptome dataset, and
Probe set data
These can be defined for specific organisms using the
cellranger_reference
and cellranger_probe_set
settings
in [organism:...]
sections of the auto_process.ini
file,
for example:
[organism: human]
cellranger_reference = /data/10x/refdata-cellranger-GRCh38-1.2.0
cellranger_probe_set = /data/10x/Chromium_Mouse_Transcriptome_Probe_Set_v1.0_mm10-2020-A.csv
scRNA-seq data: transcriptome reference data set
Annotation data
Annotation data in BED and GTF formats can be specified for
organisms of interest via the annotation_bed
and annotation_gtf
settings respectively in [organism:...]
sections of the
auto_process.ini
file.
For example:
[organism: human]
annotation_bed = /data/genomeIndexes/hg38/annotation/hg38_NCBI_RefSeq_All.bed
annotation_gtf = /data/genomeIndexes/hg38/annotation/hg38_NCBI_RefSeq_All.gtf
[organism: mouse]
annotation_bed = /data/genomeIndexes/mm10/annotation/gencode.vM25.annotation.bed
annotation_gtf = /data/genomeIndexes/mm10/annotation/gencode.vM25.annotation.gtf
process_icell8 (contaminant filtering)
The contaminant filtering stage of process_icell8
needs
two fastq_screen
conf files to be set up, one containing
bowtie
indexes for “mammalian” genomes (typically human
and mouse) and another containing indexes for “contaminant”
genomes (yeast, E.coli, UniVec7, PhiX, mycoplasma, and
adapter sequences).
These can be defined in the icell8
section of the
auto_process.ini
file, for example:
[icell8]
mammalian_conf_file = /data/icell8/mammalian_genomes.conf
contaminants_conf_file = /data/icell8/contaminant_genomes.conf
or else must be specified using the relevant command line options.
Building indexes for aligners
The _utilities_build_index.py utility can be used to
build indexes for bowtie
, bowtie2
and STAR
from
the appropriate data files (which must be obtained
separately).
For example: to build indexes for hg38
using STAR version
2.7.7a:
build_index.py star -V 2.7.7a \
-o hg38_STAR_2.7.7a_gencode40 \
/mnt/genome_data/hg38/hg38.fa \
/mnt/genome_data/hg38/hg38.gencode.v40.annotation.gtf
Note
If Using conda to resolve pipeline dependencies isn’t enabled then
the required aligner must be accessible on the PATH
,
and the requested aligner version will be ignored.