Utilities
Note
This documentation has been auto-generated from the command help
In addition to the main auto_process.py command, a number of utilities
are available:
analyse_barcodes.py
usage:
analyse_barcodes.py FASTQ [FASTQ...]
analyse_barcodes.py DIR
analyse_barcodes.py -c COUNTS_FILE [COUNTS_FILE...]
Collate and report counts and statistics for Fastq index sequences (aka
barcodes). If multiple Fastq files are supplied then sequences will be pooled
before being analysed. If a single directory is supplied then this will be
assumed to be an output directory from bcl2fastq and files will be processed
on a per-lane basis. If the -c option is supplied then the input must be one
or more file of barcode counts generated previously using the -o option.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
Input and output options:
-c, --counts input is one or more counts files generated by
previous runs using the '-o/--output' option
-o COUNTS_FILE_OUT, --output COUNTS_FILE_OUT
output all counts to tab-delimited file
COUNTS_FILE_OUT. This can be used again in another run
by specifying the '-c' option
Reporting options:
-l LANES, --lanes LANES
restrict analysis to the specified lane numbers
(default is to process all lanes). Multiple lanes can
be specified using ranges (e.g. '2-3'), comma-
separated list ('5,7') or a mixture ('2-3,5,7')
-m MISMATCHES, --mismatches MISMATCHES
maximum number of mismatches to use when grouping
similar barcodes (will be determined automatically if
samplesheet is supplied, otherwise defaults to 0)
--cutoff CUTOFF exclude barcodes/barcode groups from reporting with a
smaller fraction of associated reads than CUTOFF, e.g.
'0.01' excludes barcodes with < 1.0% of reads
(default: 0.001)
-s SAMPLE_SHEET, --sample-sheet SAMPLE_SHEET
report best matches against barcodes in SAMPLE_SHEET
-r REPORT_FILE, --report REPORT_FILE
write report to REPORT_FILE (otherwise write to
stdout)
-x XLS_FILE, --xls XLS_FILE
write XLS version of report to XLS_FILE
-f HTML_FILE, --html HTML_FILE
write HTML version of report to HTML_FILE
-t TITLE, --title TITLE
title for HTML report (default: 'Barcodes Report')
-n, --no-report suppress reporting (overrides --report)
Advanced options:
--sequences count sequences instead of barcodes
--seq-start SEQ_START
specify first base of sequence to analyse (for
--sequences option; default is start at the first base
position)
--seq-end SEQ_END specify last base of sequence to analyse (for
--sequences option; default is start at the first base
position)
--minimum_read_fraction FRACTION
weed out individual barcodes from initial analysis
which have a smaller fraction of reads than FRACTION,
e.g. '0.001' removes barcodes with < 0.1% of reads;
speeds up analysis at the expense of accuracy as
reported counts will be approximate (default: 1.0e-6)
assign_barcodes.py
usage: assign_barcodes.py [-h] [-n N] INPUT.fq OUTPUT.fq
Extract arbitrary sequence fragments from reads in INPUT.fq FASTQ file and
assign these as the index (barcode) sequences in the read headers in
OUTPUT.fq.
positional arguments:
INPUT.fq Input FASTQ file
OUTPUT.fq Output FASTQ file
optional arguments:
-h, --help show this help message and exit
-n N remove first N bases from each read and assign these as barcode
index sequence (default: 5)
audit_projects.py
usage: audit_projects.py [-h] [--pi PI_NAME] [--unassigned] [DIR [DIR ...]]
Summarise the disk usage for runs that have been processed using auto_process.
The supplied DIRs are directories holding one or more top-level analysis
directories corresponding to different runs. The program reports total disk
usage for projects assigned to each PI across all DIRs.
positional arguments:
DIR directory to search for analysis directories for auditing
optional arguments:
-h, --help show this help message and exit
--pi PI_NAME list data for PI(s) matching PI_NAME (can use glob-style
patterns)
--unassigned list data for projects where PI is not assigned
build_index.py
usage: build_index.py [-h] [-v] [-o OUT_DIR] [--ebwt_base NAME]
[--bt2_base NAME] [--overhang N] [--sa_index_nbases N]
[-V VERSION] [-r RUNNER]
ALIGNER FASTA [ANNOTATION]
Generate indexes for aligners
positional arguments:
ALIGNER aligner to build index for (one of 'bowtie',
'bowtie2', 'star')
FASTA FASTA file with sequence
ANNOTATION annotation file (for use with STAR)
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-o OUT_DIR output directory for indexes
Bowtie-specific options:
--ebwt_base NAME specify basename for output .ebwt files (defaults to
FASTA file basename)
Bowtie-specific options:
--bt2_base NAME specify basename for output .bt2 files (defaults to
FASTA file basename)
STAR-specific options:
--overhang N set value for STAR --sjdbOverhang option (default:
100)
--sa_index_nbases N set value for STAR --genomeSAindexNbases option
(default: use default in STAR)
Advanced options:
-V VERSION, --aligner-version VERSION
specify the version of the aligner to target (only
works if conda dependency resolution is configured)
-r RUNNER, --runner RUNNER
explicitly specify runner definition for building the
index. RUNNER must be a valid job runner specification
e.g. 'GEJobRunner(-pe smp.pe 8)' (default: use
appropriate runner from configuration)
concat_fastqs.py
usage: concat_fastqs.py [-h] [--version] [-v] FASTQ [FASTQ ...] FASTQ_OUT
Concatenate reads from one or more input Fastq files into a single new file
FASTQ_OUT
positional arguments:
FASTQ Input FASTQ to concatenate
FASTQ_OUT Output FASTQ with concatenated reads
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v, --verbose verbose output
barcode_splitter.py
usage:
barcode_splitter.py [OPTIONS] FASTQ [FASTQ ...]
barcode_splitter.py [OPTIONS] FASTQ_R1,FASTQ_R2 [FASTQ_R1,FASTQ_R2 ...]
barcode_splitter.py [OPTIONS] DIR
Split reads from one or more input Fastq files into new Fastqs based on
matching supplied barcodes.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-b INDEX_SEQ, --barcode INDEX_SEQ
specify index sequence to filter using
-m N_MISMATCHES, --mismatches N_MISMATCHES
maximum number of differing bases to allow for two
index sequences to count as a match. Default is zero
(i.e. exact matches only)
-n BASE_NAME, --name BASE_NAME
basename to use for output files
-o OUT_DIR, --output-dir OUT_DIR
specify directory for output split Fastqs
-u UNALIGNED_DIR, --unaligned UNALIGNED_DIR
specify subdirectory with outputs from bcl2fastq
-l LANE, --lane LANE specify lane to collect and split Fastqs for
download_fastqs.py
usage: download_fastqs.py [-h] URL [DIR]
Download checksum file and fastqs from URL into current directory (or
directory DIR, if specified), and verify the downloaded files against the
checksum file.
positional arguments:
URL URL with checksum file and fastqs
DIR directory to put downloaded fastqs into (defaults to current
directory)
optional arguments:
-h, --help show this help message and exit
fastq_statistics.py
usage: fastq_statistics.py [-h] [-v] [--unaligned UNALIGNED_DIR]
[--sample-sheet SAMPLE_SHEET] [-o STATS_FILE]
[-p PER_LANE_STATS_FILE]
[-s PER_LANE_SAMPLE_STATS_FILE]
[-f FULL_STATS_FILE] [-u] [-n N] [--debug]
[--force]
ILLUMINA_RUN_DIR
Generate statistics for FASTQ files in ILLUMINA_RUN_DIR (top-level directory
of a processed Illumina run)
positional arguments:
ILLUMINA_RUN_DIR input Illumina run directory
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
--unaligned UNALIGNED_DIR
specify an alternative name for the 'Unaligned'
directory containing the fastq.gz files
--sample-sheet SAMPLE_SHEET
specify a sample sheet file to get additional
information from
-o STATS_FILE, --output STATS_FILE
name of output file for per-file statistics (default
is 'statistics.info')
-p PER_LANE_STATS_FILE, --per-lane-stats PER_LANE_STATS_FILE
name of output file for per-lane statistics (default
is 'per_lane_statistics.info')
-s PER_LANE_SAMPLE_STATS_FILE, --per-lane-sample-stats PER_LANE_SAMPLE_STATS_FILE
name of output file for per-lane statistics (default
is 'per_lane_sample_stats.info')
-f FULL_STATS_FILE, --full-stats FULL_STATS_FILE
name of output file for full statistics (default is
'statistics_full.info')
-u, --update update existing full statistics file with stats for
additional files
-n N, --nprocessors N
spread work across N processors/cores (default is 1)
--debug turn on debugging output
Deprecated/defunct options:
--force does nothing: retained for backwards compatibility
fetch_data.py
usage: fetch_data.py [-h] [--version] [--flatten] [--overwrite]
[--runner RUNNER]
SOURCE DEST
Copy files and directories from arbitrary locations to the local system
positional arguments:
SOURCE source data (file or directory) to copy; can be on a local
or remote file system. NB if source is a directory then the
contents are copied (not the top-level directory)
DEST destination directory on local file system
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
--flatten copy files without replicating the source directory
structure
--overwrite overwrite existing files (default is to skip existing
files)
--runner RUNNER specify the job runner to use for executing 'rsync'
operations (defaults to job runner defined for copying in
config file [SimpleJobRunner(join_logs=True)])
manage_fastqs.py
usage:
manage_fastqs.py DIR
manage_fastqs.py DIR PROJECT
manage_fastqs.py DIR PROJECT copy [[user@]host:]DEST
manage_fastqs.py DIR PROJECT md5
manage_fastqs.py DIR PROJECT zip
Fastq management utility. If only DIR is supplied then list the projects; if
PROJECT is supplied then list the fastqs; 'copy' command copies fastqs for the
specified PROJECT to DEST on a local or remote server; 'md5' command generates
checksums for the fastqs; 'zip' command creates a zip file with the fastq
files.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
--samples SAMPLE_LIST
list of names of samples to transfer
--filter PATTERN filter file names for reporting and copying based on
PATTERN
--fastq_dir FASTQ_DIR
explicitly specify subdirectory of DIR with Fastq
files to run the QC on
--max_zip_size MAX_ZIP_SIZE
for 'zip' command, defines the maximum size for the
output zip file; multiple zip files will be created if
the data exceeds this limit (default is create a
single zip file with no size limit)
--link hard link files instead of copying
run_qc.py
usage: run_qc.py [-h] [--version] [--info] [-n NAME] [-o OUT_DIR]
[--qc_dir QC_DIR] [-f FILENAME] [-u] [--organism ORGANISM]
[--library-type LIBRARY] [--single-cell-platform PLATFORM]
[--biological-samples SAMPLES] [-p PROTOCOL]
[--fastq-pattern PATTERN] [--fastq_subset SUBSET]
[-t NTHREADS] [--star-index INDEX] [--gtf GTF]
[--cellranger CELLRANGER_EXE]
[--cellranger-reference REFERENCE]
[--cellranger-probeset PROBE_SET]
[--cellranger-vdj-reference VDJ_REFERENCE]
[--10x_chemistry {ARC-v1,SC3Pv1,SC3Pv2,SC3Pv3,SC3Pv3HT,SC3Pv3LT,SC3Pv4,SC5P-PE,SC5P-PE-v3,SC5P-R2,SC5P-R2-v3,auto,fiveprime,threeprime}]
[--10x_force_cells N_CELLS]
[--10x_library [SAMPLE:]FASTQ_ID:FEATURE_TYPE]
[--10x_multiplexed_samples [SAMPLE:]MULTIPLEXED_SAMPLE=ID[,...]]
[--enable-conda {yes,no}] [--conda-env-dir CONDA_ENV_DIR]
[--local] [-c N] [-m M] [-j N] [-b NBATCHES]
[--protocol-spec SPECIFICATION] [--index-reads READS]
[--sequence-reads READS] [--qc-modules QC_MODULES]
[-r RUNNER] [-s N] [--ignore-metadata]
[--split-fastqs-by-lane] [--shorten-zip-paths {yes,no}]
[--use-legacy-screen-names {yes,no}] [--no-verify-fastqs]
[--no-multiqc] [--verbose] [--work-dir WORKING_DIR]
[--no-cleanup] [--fastq_screen_subset SUBSET] [--force]
[--multiqc]
DIR | FASTQ [FASTQ ...] [DIR | FASTQ [FASTQ ...] ...]
Run the QC pipeline standalone on an arbitrary set of Fastq files.
positional arguments:
DIR | FASTQ [ FASTQ ... ]
directory or list of Fastq files to run the QC on
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
--info display information on protocols, organisms and other
settings (then exit)
Output and reporting:
-n NAME, --name NAME name for the project (used in report title)
-o OUT_DIR, --out_dir OUT_DIR
top-level directory for reports and QC output
subdirectory (default: current working directory)
--qc_dir QC_DIR explicitly specify QC output directory. NB if a
relative path is supplied then it's assumed to be a
subdirectory of OUT_DIR (default: <OUT_DIR>/qc)
-f FILENAME, --filename FILENAME
file name for output QC report (default:
<OUT_DIR>/<QC_DIR_NAME>_report.html)
-u, --update force QC pipeline to run even if output QC directory
already exists in <OUT_DIR> (default: stop if output
QC directory already exists)
Metadata:
--organism ORGANISM explicitly specify organism (e.g. 'human', 'mouse').
Multiple organisms should be separated by commas (e.g.
'human,mouse'). HINT use the --info option to list the
defined organisms
--library-type LIBRARY
explicitly specify library type (e.g. 'RNA-seq',
'ChIP-seq')
--single-cell-platform PLATFORM
explicitly specify the single cell platform (e.g.
'10xGenomics Chromium 3'v3')
--biological-samples SAMPLES
explicitly specify subset of sample names with
biological data as comma-separated list (e.g.
'AB1,AB2,..') (default: assume that all samples
contain biological data)
QC options:
-p PROTOCOL, --protocol PROTOCOL
explicitly specify the QC protocol to use; can be one
of 'minimal', 'standard', 'standardSE', 'standardPE',
'10x_scRNAseq', '10x_snRNAseq', '10x_scATAC',
'10x_Multiome_GEX', '10x_Multiome_ATAC', '10x_OCM',
'10x_CellPlex', '10x_Flex', '10x_ImmuneProfiling',
'10x_Visium_GEX', '10x_Visium_GEX_90bp_insert',
'10x_Visium_PEX', '10x_Visium_legacy',
'ParseEvercode', 'BioRad_ddSEQ_ATAC', 'singlecell'. If
not set then protocol will be determined automatically
based on directory contents and metadata.
--fastq-pattern PATTERN
specify a custom pattern for identifying sample and
read number information from the Fastq names (e.g.
'{SAMPLE}_{READ}')
--fastq_subset SUBSET
specify size of subset of reads to use for
FastQScreen, strandedness, coverage etc option);
(default 100000, set to 0 to use all reads)
-t NTHREADS, --threads NTHREADS
number of threads to use for multicore jobs (default:
taken from job runners; ignored when using --local)
Reference data:
--star-index INDEX specify the path to the STAR genome index to use when
mapping reads for metrics such as strandedness etc
(overrides the organism-specific indexes defined in
the config file)
--gtf GTF specify the path to the GTF annotation file to use for
metrics such as 'qualimap rnaseq' (overrides the
organism-specific GTF files defined in the config
file)
Cellranger/10xGenomics options:
--cellranger CELLRANGER_EXE
explicitly specify path to Cellranger executable to
use for single library analysis
--cellranger-reference REFERENCE
specify the path to the reference dataset to use when
running single libary analysis (overrides the
organism-specific references defined in the config
file)
--cellranger-probeset PROBE_SET
specify the path to the probe set reference dataset
for 'cellranger multi'
--cellranger-vdj-reference VDJ_REFERENCE
specify the path to the VDJ reference dataset for
'cellranger multi'
--10x_chemistry {ARC-v1,SC3Pv1,SC3Pv2,SC3Pv3,SC3Pv3HT,SC3Pv3LT,SC3Pv4,SC5P-PE,SC5P-PE-v3,SC5P-R2,SC5P-R2-v3,auto,fiveprime,threeprime}
assay configuration for 10xGenomics scRNA-seq; if set
to 'auto' (the default) then cellranger will attempt
to determine this automatically
--10x_force_cells N_CELLS
force number of cells for 10xGenomics scRNA-seq and
scATAC-seq, overriding automatic cell detection
algorithms (default is to use built-in cell detection)
--10x_library [SAMPLE:]FASTQ_ID:FEATURE_TYPE
assign feature type to a Fastq file ID for 'cellranger
multi' analysis (optionally also specify a physical
sample)
--10x_multiplexed_samples [SAMPLE:]MULTIPLEXED_SAMPLE=ID[,...]
assign multiplexed samples name to CMO or probe IDs
for 'cellranger multi' analysis (optionally also
specify a physical sample)
Conda dependency resolution:
--enable-conda {yes,no}
use conda to resolve task dependencies; can be 'yes'
or 'no' (default: no)
--conda-env-dir CONDA_ENV_DIR
specify directory for conda enviroments (default:
temporary directory)
Job control options:
--local run the QC on the local system (overrides any runners
defined in the configuration or on the command line)
-c N, --maxcores N maximum number of cores available for QC jobs when
using --local (default no limit, change in in settings
file)
-m M, --maxmem M maximum total memory jobs can request at once when
using --local (in Gbs; default: unlimited)
-j N, --maxjobs N explicitly specify maximum number of concurrent QC
jobs to run (default 12, change in settings file;
ignored when using --local)
-b NBATCHES, --maxbatches NBATCHES
enable dynamic batching of pipeline jobs with maximum
number of batches set to NBATCHES (default: no
batching)
Custom QC protocol:
--protocol-spec SPECIFICATION
specify the QC protocol to use as a full specification
string
--index-reads READS explicitly specify the reads to treat as index
sequences (e.g. 'R2')
--sequence-reads READS
explicitly specify the reads to treat as biological
sequences data (e.g. 'R1,R3')
--qc-modules QC_MODULES
explicitly specify the QC modules to run (e.g.
'fastqc,fastq_screen')
Advanced options:
-r RUNNER, --runner RUNNER
explicitly specify runner definition for running QC
components. RUNNER must be a valid job runner
specification e.g. 'GEJobRunner(-j y)' (default: use
runners set in configuration)
-s N, --batch_size N batch QC commands with N commands per job (default: no
batching)
--ignore-metadata ignore information from project metadata file even if
one is located (default is to use project metadata)
--split-fastqs-by-lane
run QC on copies of input Fastqs where reads have been
split according to lane (default is to run QC on
original Fastqs)
--shorten-zip-paths {yes,no}
shorten paths in the QC report ZIP file; can be 'yes'
or 'no' (default: no)
--use-legacy-screen-names {yes,no}
use 'legacy' naming convention for FastqScreen output
files; can be 'yes' or 'no' (default: no)
--no-verify-fastqs skip Fastq verification (default: verify Fastqs before
running QC)
--no-multiqc turn off generation of MultiQC report
Debugging options:
--verbose run pipeline in 'verbose' mode
--work-dir WORKING_DIR
specify the working directory for the pipeline
operations
--no-cleanup don't remove the temporary project directory on
completion (by default the temporary directory is
deleted)
Deprecated/redundant options:
--fastq_screen_subset SUBSET
redundant: use the --fastq_subset option instead
--force redundant: HTML report generation will always be
attempted (even when pipeline fails)
--multiqc redundant: MultiQC report is generated by default (use
--no-multiqc to disable)
transfer_data.py
usage: transfer_data.py [-h] [--version] [--subdir {random_bin,run_id}]
[--samples SAMPLE_LIST] [--filter FILTER_PATTERN]
[--no_fastqs] [--zip_fastqs]
[--max_zip_size MAX_ZIP_SIZE]
[--readme README_TEMPLATE] [--weburl WEBURL]
[--include_qc_report] [--include_10x_outputs]
[--include_cloupe_files] [--include_visium_images]
[--include_downloader] [--link] [--runner RUNNER]
[--dry-run]
DEST PROJECT
Transfer copies of Fastq data from an analysis project to an arbitrary
destination for sharing with other people
positional arguments:
DEST destination to copy Fastqs to; can be the name of a
destination defined in the configuration file, or an
arbitrary location of the form '[[USER@]HOST:]DIR' (no
destinations currently defined)
PROJECT path to project directory (or to a Fastqs subdirectory
in a project) to copy Fastqs from
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
--subdir {random_bin,run_id}
subdirectory naming scheme: 'random_bin' locates a
random pre-existing empty subdirectory under the
target directory; 'run_id' creates a new subdirectory
'PLATFORM_DATESTAMP.RUN_ID-PROJECT'. If this option is
not set then no subdirectory will be used
Fastq selection:
--samples SAMPLE_LIST
list of names of samples to transfer
--filter FILTER_PATTERN
filter Fastq file names based on PATTERN
--no_fastqs don't copy Fastqs (other artefacts will be copied, if
specified)
ZIP file archives:
--zip_fastqs put Fastqs into a ZIP file
--max_zip_size MAX_ZIP_SIZE
when using '--zip_fastqs' option, defines the maximum
size for the output zip file; multiple zip files will
be created if the data exceeds this limit (default is
create a single zip file with no size limit)
README generation:
--readme README_TEMPLATE
template file to generate README file from; can be
full path to a template file, or the name of a file in
the 'templates' directory
--weburl WEBURL base URL for webserver (sets the value of the WEBURL
variable in the template README)
Additional artefacts:
--include_qc_report copy the zipped QC reports to the final location
--include_10x_outputs
copy outputs from 10xGenomics pipelines (e.g.
'cellranger count') to the final location
--include_cloupe_files
copy .cloupe files output from 10xGenomics pipelines
to the final location
--include_visium_images
copy images for 10x Genomics Visium projects to the
final location
--include_downloader copy the 'download_fastqs.py' utility to the final
location
Advanced options:
--link hard link files instead of copying
--runner RUNNER specify the job runner to use for executing the
checksumming, Fastq copy and tar gzipping operations
(defaults to the 'transfer_data' job runner defined in
config file [SimpleJobRunner(join_logs=True)])
--dry-run report what would be done but don't actually transfer
any data