Configuration
Overview
The autoprocessor reads its global settings for the local system from a
auto_process.ini
file, which it looks for in order in the following
locations:
The file specified by the
AUTO_PROCESS_CONF
environment variable (if set)The current directory
The
config
subdirectory of the installation directoryThe installation directory (for legacy installations only)
Note
In previous versions of the package the configuration file was
called settings.ini
, and this will be used as a fallback if
no auto_process.ini
file is found in the locations above.
To create a auto_process.ini
file for a new installation, use the
command
auto_process.py config --init
To see the current settings, do
auto_process.py config
To change the settings, either use the --set
options, for example
auto_process.py config --set bcl2fastq.nprocessors=4
or simply edit the auto_process.ini
file by hand using a text editor.
Note
If no auto_process.ini
file exists then auto_process
will run
using the built-in default values.
Note
Many of the configuration options can be over-ridden at run time
using command line options for the specific auto_process
subcommands.
Basic configuration
Using the basic auto_process
Fastq generation requires minimal
configuration when running locally; provided that the required
BCL conversion software is available on the system (see
Software dependencies) it should run without further setup.
Running the QC pipeline requires additional software plus reference data which are described in more detail in Reference data.
Sequencers and platforms
Information about sequencers used at the local site can be stored
in sequencer
sections of the configuration file.
A sequencer can be defined by adding a new section of the form
[sequencer:INSTRUMENT_NAME]
, where INSTRUMENT_NAME
is the
ID name for the instrument. Within each section the following
data items can then be associated with the sequencer:
|
Compulsory sets the generic platform name
(one of |
|
Text describing the sequencer model (e.g.
|
For example: if the local facility has a HiSeq 4000 instrument
with ID SN7001250
then this would be defined in auto_process.ini
as follows:
[sequencer:SN7001250]
platform = hiseq4000
model = "HiSeq 4000"
The instrument name can be derived from the name of the directories produced by the sequencer (see Run and Fastq naming conventions).
Note
These sections replace the old
sequencers
section used to define the sequencer platforms, e.g.[sequencers] SN7001250 = hiseq4000
This section is still supported but is now deprecated.
Each platform referenced in the [sequencer:...]
sections can
optionally be defined in its own [platform:...]
section, where
platform-specific options for Fastq generation can be set to
override those in the [bcl_conversion]
section.
The available options are:
|
Specify the BCL conversion software to be used when processing data from this platform (see Specifying BCL to Fastq conversion software and options) |
|
Optionally, specify the number of processors to use when performing the BCL to Fastq conversion (deprecated, it is recommended to set this implicitly via the job runners - see Setting number of available CPUs) |
|
Specify whether to merge Fastqs for the same
sample across lanes (set to |
|
Specify whether to create “empty” placeholder Fastqs for samples where demultiplexing failed to assign any reads |
For example:
[platform:hiseq4000]
bcl_converter = bcl2fastq>=2.20
Default metadata
The metadata
section of the configuration file allows defaults
to be specified for metadata items associated with each run.
Currently it is possible to set a default for the source
metadata item, which specifies where the data was received from,
for example:
[metadata]
default_data_source = "Local sequencing facility"
If no default is set then the values can be updated using the
metadata
command (see metadata).
Job Runners
Job runners are used within auto_process
to tell the pipelines
how to execute commands. There are currently two types of runner available:
SimpleJobRunner
runs jobs as a subprocess of the current process, so they run locally (i.e. on the same hardware that theauto_process
command was started on)GEJobRunner
submits jobs to Grid Engine (GE), which enables it to exploit additional resources available on a compute cluster (see Running on a compute cluster)
Job runners can also be configured to specify the number of CPUs available to commands that are executed using them (see Setting number of available CPUs).
By default auto_process
is configured to use SimpleJobRunner
for all jobs; the default runner is defined in the settings:
[general]
default_runner = SimpleJobRunner
This default can be overridden for specific commands and pipeline
stages by explicitly specifying alternative runners in the runners
section of the settings file:
Runner name |
Used for |
---|---|
|
Running barcode analysis tasks in Fastq generation |
|
Running |
|
Running |
|
Running |
|
Running |
|
Running |
|
Running commands to generate statistics
after Fastq generation (e.g.
|
|
Running commands for transferring data (e.g. copying primary data for Fastq generation, archiving etc) |
|
Running |
|
Running |
|
Merging Fastq files in Fastq generation |
|
Running pipeline tasks which use |
|
Running |
|
Running |
|
Running |
|
Running jobs for QC publication |
|
Default runner for commands in the ICELL8 processing pipeline |
|
Running the contaminant filtering in the ICELL8 pipeline |
|
Generating statistics for ICELL8 data |
|
Reporting on the ICELL8 pipeline |
The following runners are supported but deprecated:
|
Running |
|
Running generally computationally intensive
QC commands (used as a fallback for
|
Note
It’s recommended to only explicitly configure those runners for which the default runner is not suitable, to avoid a proliferation of unnecessary runner defintions in the configuration file.
Setting number of available CPUs
Job runners allow the number of available CPUs (aka processors or
threads) to be specified, and this information is then used when
running jobs in the auto_process
pipelines.
For SimpleJobRunners
the number of CPUs is specified via the
nslots
argument. For example:
[runners]
qc = SimpleJobRunner(nslots=8)
(Without nslots
the number of CPUs implicitly defaults to 1.)
For GEJobRunners
the number of available CPUs is inferred from the
-pe smp.pe
argument (see Running on a compute cluster).
For some commands the number of available CPUs will be taken implicitly from this argument unless explicitly overridden by the following settings:
Section |
Setting |
Overrides runner |
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(*) Used when cellranger
is run with --jobmode=local
Running on a compute cluster
The GEJobRunner
can be used to make auto_process
submit its
computationally intensive jobs to a compute cluster rather than on
the local host; to switch to using GEJobRunner
, set the default
runner in the settings:
[general]
default_runner = GEJobRunner
Additional options for Grid Engine submission can be specified by enclosing when defining the runner, for example sending all jobs to a particular queue might use:
default_runner = GEJobRunner(-q ngs.queue)
This default runer can further be overridden for specific commands
and pipeline stages by the settings in the runners
section of the
configuration file (see the previous section Job Runners).
For example: to run bcl2fastq
jobs in parallel environment
with 8 cores might look like:
[runners]
bcl2fastq = GEJobRunner(-pe smp.pe 8)
Note
If you specify multiple processors for the bcl2fastq
runner and are
using GEJobRunner
then you should ensure that the job runner requests
a suitable number of cores when submitting jobs.
Note
When running on a cluster the auto_process
driver process should
run on the cluster login node; it has a small CPU and memory footprint
which should impact minimally on other users of the system.
Managing concurrent jobs and process loads
There are a number of settings available in the [general]
section which allow limits to be set on the resources that
auto_process
will try to consume when running jobs and
pipelines:
Setting |
|
---|---|
|
Maximum number of jobs that |
|
Maximum number of cores that |
|
Dynamically sets batch sizes within pipelines so that number of job batches from each task doesn’t exceed this number |
For example:
[general]
max_cores = 24
If any of these is set to zero or None
then this means
that resource is not limited by auto_process
.
max_concurrent_jobs
and max_batches
are useful on
shared cluster systems, to avoid submitting large numbers of
jobs at one time.
max_cores
is useful when running on a local workstation,
to avoid exceeding resource limits while ensuring the most
efficient use of the available CPUs.
Using environment modules
Environment modules provide a way to dynamically modify the user’s environment. They can be especially useful to provide access to multiple versions of the same software package, and to manage conflicts between packages.
The [modulefiles]
section in auto_process.ini
allows specific module
files to be loaded before a specific step, for example:
[modulefiles]
make_fastqs = apps/bcl2fastq/2.20
These can be defined for the following stages:
make_fastqs
run_qc
publish_qc
process_icell8
(see Software dependencies for details of what software is required for each of these stages.)
Note
These can be overridden for the make_fastqs
and run_qc
stages
using the --modulefiles
option.
Environment modules for Fastq generation
For the make_fastqs
stage, additional module files can be specified
for individual tasks with the Fastq generation pipeline:
bcl2fastq
bcl_convert
cellranger_mkfastq
cellranger_atac_mkfastq
cellranger_arc_mkfastq
If any of these are defined then they will be loaded for the relevant tasks in the Fastq generation pipeline.
Environment modules for the QC pipeline
For the run_qc
stage, additional module files can be specified for
individual tasks within the QC pipeline:
fastqc
fastq_screen
fastq_strand
cellranger
report_qc
If any of these are defined then they will be loaded for the relevant tasks in the QC pipeline.
Note
In older pipeline versions the illumina_qc
module file setting
was used for the illumina_qc.sh
script, which ran both
FastQC and FastqScreen. illumina_qc.sh
has now been dropped
however if the illumina_qc
modulefile is still set in the
configuration then this will be used as a fallback if fastqc
and fastq_screen
module files are not set explicitly.
Using conda to resolve pipeline dependencies
For certain pipelines and tasks it is possible to enable the conda
package management utility to handle setting up appropriate run-time
environments, rather than having to manually install the required
dependencies and specify their locations (e.g. using environment
modules).
To do this by default, set the enable_conda
parameter in the
[conda]
section, i.e.:
[conda]
enable_conda = true
Note that this requires conda
to be installed and available on the
user’s PATH
at run-time.
By default a temporary directory will be used when creating and reusing
conda
environments, but this can be overriden by setting the
env_dir
parameter, e.g.:
[conda]
enable_conda = true
env_dir = $HOME/conda_envs
Specifying BCL to Fastq conversion software and options
The [bcl_conversion]
section sets the default settings for BCL
to Fastq generation:
|
Specify the BCL conversion software to be used when processing data from this platform; see below for more information |
|
Optionally, specify the number of processors to use when performing the BCL to Fastq conversion (deprecated, it is recommended to set this implicitly via the job runners - see Setting number of available CPUs) |
|
Specify whether to merge Fastqs for the same
sample across lanes (set to |
|
Specify whether to create “empty” placeholder Fastqs for samples where demultiplexing failed to assign any reads |
Note
This replaces the settings in the old [bcl2fastq]
section,
which is now deprecated.
The bcl_converter
setting can be used to specify both the software
package and optionally also a required version; it takes the general
form:
bcl_converter = PACKAGE[REQUIREMENT]
Valid package names are:
bcl2fastq
bcl-convert
Version requirements are specified by prefacing the version number by
one of the operators >
, >=
, <=
and <
(==
can also
be specified explicitly), for example:
bcl_converter = bcl-convert>=3.7
Alternatively a comma-separated list can be provided:
bcl_converter = bcl2fastq>=1.8.3,<2.0
If no version is explicitly specified then the highest available version will be used.
QC pipeline configuration
Several steps in the QC pipeline require reference data to be defined as described in the section run_qc.
Additionally the [qc]
section allows other aspects of the
QC pipeline operation to be explicitly specified.
Setting size of Fastq read subsets
The default size of the subset of reads used by FastqScreen
when generating the screens, generating BAM files and so on
can be set using the fastq_subset_size
parameter, e.g.:
[qc]
fastq_subset_size = 10000
...
Setting this to zero will force all reads to be used for the appropriate QC stages (note that this can result in extended run time for the QC pipeline, and larger intermediate and final output files).
Note
fastq_subset_size
replaces the deprecated legacy
fastq_screen_subset
parameter (which will however be
used as a fallback if subset_size
is not present).
Per-lane QC for undetermined Fastqs
By default the QC pipeline will split Fastqs from the
undetermined
project into separate lanes, in order to
generate per-lane metrics.
This is controlled by the split_undetermined_fastqs
parameter, which by default is implicitly set as:
[qc]
split_undetermined_fastqs = True
...
To disable the lane splitting, set this parameter to False
.
FastqScreen output file naming conventions
By default the QC pipeline creates FastqScreen outputs using the following naming convention:
{FASTQ}_screen_{SCREEN_NAME}.png
{FASTQ}_screen_{SCREEN_NAME}.txt
for example PJB_S1_L001_R1_001_screen_model_organisms.png
.
It is possible to revert to the older “legacy” naming
convention ({FASTQ}_{SCREEN_NAME}_screen.png
etc) by
setting the use_legacy_screen_names
parameter in the qc
section:
[qc]
use_legacy_screen_names = True
...
Data transfer destinations
The transfer_data.py
utility can be used to copy Fastqs and other
data produced by the auto_process.py
pipeline to arbitrary
destinations, typically for sharing with end users of the pipeline.
The utility provides a number of command line options to specify a destination and the data that are transferred at runtime. However it is also to define one or more destinations in the configuration file, with appropriate presets for each destination.
A destination can be defined by adding a new section to the config
file of the form [destination:NAME]
, where NAME
is the name
that will be used to refer to the destination when it is specified in
a run of transfer_data.py
.
Within each section the following parameters can be set for the destination:
Parameter |
Function |
---|---|
|
Compulsory sets the destination directory
to copy files to; can be an arbitrary location
of the form |
|
Subdirectory naming scheme |
|
Whether to bundle Fastqs into ZIP archives |
|
Maximum size for each ZIP archive (if Fastqs are bundled) |
|
Template file to generate |
|
Base URL to access copied data at |
|
Whether to include |
|
Whether to include zipped QC reports |
|
Whether to hard link to Fastqs rather making copies (for local directories on the same file system as the original Fastqs) |
For example:
[destination:webserver]
directory = /mnt/hosted/web
subdir = random_bin
readme_template = README.webserver
url = http://ourdata.com/shared
hard_links = true
See transfer_data.py: copying data for transfer to end users for more information on what these settings do.
Bash tab completion
The auto_process-completion.bash
file (installed into the
etc/bash_completion.d
subdirectory of the installation location) can
used to enable tab completion of auto_process.py commands within bash
shells.
For a global installation, copy the file to the system’s
/etc/bash_completion.d/
directory, to make it available to all usersFor a local installation, source the file when setting up the environment for the installation (or source it in your
~/.bashrc
or similar).