Utilities

Note

This documentation has been auto-generated from the command help

In addition to the main auto_process.py command, a number of utilities are available:

analyse_barcodes.py

usage:
    analyse_barcodes.py FASTQ [FASTQ...]
    analyse_barcodes.py DIR
    analyse_barcodes.py -c COUNTS_FILE [COUNTS_FILE...]

Collate and report counts and statistics for Fastq index sequences (aka
barcodes). If multiple Fastq files are supplied then sequences will be pooled
before being analysed. If a single directory is supplied then this will be
assumed to be an output directory from bcl2fastq and files will be processed
on a per-lane basis. If the -c option is supplied then the input must be one
or more file of barcode counts generated previously using the -o option.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

Input and output options:
  -c, --counts          input is one or more counts files generated by
                        previous runs using the '-o/--output' option
  -o COUNTS_FILE_OUT, --output COUNTS_FILE_OUT
                        output all counts to tab-delimited file
                        COUNTS_FILE_OUT. This can be used again in another run
                        by specifying the '-c' option

Reporting options:
  -l LANES, --lanes LANES
                        restrict analysis to the specified lane numbers
                        (default is to process all lanes). Multiple lanes can
                        be specified using ranges (e.g. '2-3'), comma-
                        separated list ('5,7') or a mixture ('2-3,5,7')
  -m MISMATCHES, --mismatches MISMATCHES
                        maximum number of mismatches to use when grouping
                        similar barcodes (will be determined automatically if
                        samplesheet is supplied, otherwise defaults to 0)
  --cutoff CUTOFF       exclude barcodes/barcode groups from reporting with a
                        smaller fraction of associated reads than CUTOFF, e.g.
                        '0.01' excludes barcodes with < 1.0% of reads
                        (default: 0.001)
  -s SAMPLE_SHEET, --sample-sheet SAMPLE_SHEET
                        report best matches against barcodes in SAMPLE_SHEET
  -r REPORT_FILE, --report REPORT_FILE
                        write report to REPORT_FILE (otherwise write to
                        stdout)
  -x XLS_FILE, --xls XLS_FILE
                        write XLS version of report to XLS_FILE
  -f HTML_FILE, --html HTML_FILE
                        write HTML version of report to HTML_FILE
  -t TITLE, --title TITLE
                        title for HTML report (default: 'Barcodes Report')
  -n, --no-report       suppress reporting (overrides --report)

Advanced options:
  --sequences           count sequences instead of barcodes
  --seq-start SEQ_START
                        specify first base of sequence to analyse (for
                        --sequences option; default is start at the first base
                        position)
  --seq-end SEQ_END     specify last base of sequence to analyse (for
                        --sequences option; default is start at the first base
                        position)
  --minimum_read_fraction FRACTION
                        weed out individual barcodes from initial analysis
                        which have a smaller fraction of reads than FRACTION,
                        e.g. '0.001' removes barcodes with < 0.1% of reads;
                        speeds up analysis at the expense of accuracy as
                        reported counts will be approximate (default: 1.0e-6)

assign_barcodes.py

usage: assign_barcodes.py [-h] [-n N] INPUT.fq OUTPUT.fq

Extract arbitrary sequence fragments from reads in INPUT.fq FASTQ file and
assign these as the index (barcode) sequences in the read headers in
OUTPUT.fq.

positional arguments:
  INPUT.fq    Input FASTQ file
  OUTPUT.fq   Output FASTQ file

optional arguments:
  -h, --help  show this help message and exit
  -n N        remove first N bases from each read and assign these as barcode
              index sequence (default: 5)

audit_projects.py

usage: audit_projects.py [-h] [--pi PI_NAME] [--unassigned] [DIR [DIR ...]]

Summarise the disk usage for runs that have been processed using auto_process.
The supplied DIRs are directories holding one or more top-level analysis
directories corresponding to different runs. The program reports total disk
usage for projects assigned to each PI across all DIRs.

positional arguments:
  DIR           directory to search for analysis directories for auditing

optional arguments:
  -h, --help    show this help message and exit
  --pi PI_NAME  list data for PI(s) matching PI_NAME (can use glob-style
                patterns)
  --unassigned  list data for projects where PI is not assigned

build_index.py

usage: build_index.py [-h] [-v] [-o OUT_DIR] [--ebwt_base NAME]
                      [--bt2_base NAME] [--overhang N] [--sa_index_nbases N]
                      [-V VERSION] [-r RUNNER]
                      ALIGNER FASTA [ANNOTATION]

Generate indexes for aligners

positional arguments:
  ALIGNER               aligner to build index for (one of 'bowtie',
                        'bowtie2', 'star')
  FASTA                 FASTA file with sequence
  ANNOTATION            annotation file (for use with STAR)

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -o OUT_DIR            output directory for indexes

Bowtie-specific options:
  --ebwt_base NAME      specify basename for output .ebwt files (defaults to
                        FASTA file basename)

Bowtie-specific options:
  --bt2_base NAME       specify basename for output .bt2 files (defaults to
                        FASTA file basename)

STAR-specific options:
  --overhang N          set value for STAR --sjdbOverhang option (default:
                        100)
  --sa_index_nbases N   set value for STAR --genomeSAindexNbases option
                        (default: use default in STAR)

Advanced options:
  -V VERSION, --aligner-version VERSION
                        specify the version of the aligner to target (only
                        works if conda dependency resolution is configured)
  -r RUNNER, --runner RUNNER
                        explicitly specify runner definition for building the
                        index. RUNNER must be a valid job runner specification
                        e.g. 'GEJobRunner(-pe smp.pe 8)' (default: use
                        appropriate runner from configuration)

concat_fastqs.py

usage: concat_fastqs.py [-h] [--version] [-v] FASTQ [FASTQ ...] FASTQ_OUT

Concatenate reads from one or more input Fastq files into a single new file
FASTQ_OUT

positional arguments:
  FASTQ          Input FASTQ to concatenate
  FASTQ_OUT      Output FASTQ with concatenated reads

optional arguments:
  -h, --help     show this help message and exit
  --version      show program's version number and exit
  -v, --verbose  verbose output

barcode_splitter.py

usage:
    barcode_splitter.py [OPTIONS] FASTQ [FASTQ ...]
    barcode_splitter.py [OPTIONS] FASTQ_R1,FASTQ_R2 [FASTQ_R1,FASTQ_R2 ...]
    barcode_splitter.py [OPTIONS] DIR

Split reads from one or more input Fastq files into new Fastqs based on
matching supplied barcodes.

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -b INDEX_SEQ, --barcode INDEX_SEQ
                        specify index sequence to filter using
  -m N_MISMATCHES, --mismatches N_MISMATCHES
                        maximum number of differing bases to allow for two
                        index sequences to count as a match. Default is zero
                        (i.e. exact matches only)
  -n BASE_NAME, --name BASE_NAME
                        basename to use for output files
  -o OUT_DIR, --output-dir OUT_DIR
                        specify directory for output split Fastqs
  -u UNALIGNED_DIR, --unaligned UNALIGNED_DIR
                        specify subdirectory with outputs from bcl2fastq
  -l LANE, --lane LANE  specify lane to collect and split Fastqs for

download_fastqs.py

usage: download_fastqs.py [-h] URL [DIR]

Download checksum file and fastqs from URL into current directory (or
directory DIR, if specified), and verify the downloaded files against the
checksum file.

positional arguments:
  URL         URL with checksum file and fastqs
  DIR         directory to put downloaded fastqs into (defaults to current
              directory)

optional arguments:
  -h, --help  show this help message and exit

fastq_statistics.py

usage: fastq_statistics.py [-h] [-v] [--unaligned UNALIGNED_DIR]
                           [--sample-sheet SAMPLE_SHEET] [-o STATS_FILE]
                           [-p PER_LANE_STATS_FILE]
                           [-s PER_LANE_SAMPLE_STATS_FILE]
                           [-f FULL_STATS_FILE] [-u] [-n N] [--debug]
                           [--force]
                           ILLUMINA_RUN_DIR

Generate statistics for FASTQ files in ILLUMINA_RUN_DIR (top-level directory
of a processed Illumina run)

positional arguments:
  ILLUMINA_RUN_DIR      input Illumina run directory

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --unaligned UNALIGNED_DIR
                        specify an alternative name for the 'Unaligned'
                        directory containing the fastq.gz files
  --sample-sheet SAMPLE_SHEET
                        specify a sample sheet file to get additional
                        information from
  -o STATS_FILE, --output STATS_FILE
                        name of output file for per-file statistics (default
                        is 'statistics.info')
  -p PER_LANE_STATS_FILE, --per-lane-stats PER_LANE_STATS_FILE
                        name of output file for per-lane statistics (default
                        is 'per_lane_statistics.info')
  -s PER_LANE_SAMPLE_STATS_FILE, --per-lane-sample-stats PER_LANE_SAMPLE_STATS_FILE
                        name of output file for per-lane statistics (default
                        is 'per_lane_sample_stats.info')
  -f FULL_STATS_FILE, --full-stats FULL_STATS_FILE
                        name of output file for full statistics (default is
                        'statistics_full.info')
  -u, --update          update existing full statistics file with stats for
                        additional files
  -n N, --nprocessors N
                        spread work across N processors/cores (default is 1)
  --debug               turn on debugging output

Deprecated/defunct options:
  --force               does nothing: retained for backwards compatibility

fetch_data.py

usage: fetch_data.py [-h] [--version] [--flatten] [--overwrite]
                     [--runner RUNNER]
                     SOURCE DEST

Copy files and directories from arbitrary locations to the local system

positional arguments:
  SOURCE           source data (file or directory) to copy; can be on a local
                   or remote file system. NB if source is a directory then the
                   contents are copied (not the top-level directory)
  DEST             destination directory on local file system

optional arguments:
  -h, --help       show this help message and exit
  --version        show program's version number and exit
  --flatten        copy files without replicating the source directory
                   structure
  --overwrite      overwrite existing files (default is to skip existing
                   files)
  --runner RUNNER  specify the job runner to use for executing 'rsync'
                   operations (defaults to job runner defined for copying in
                   config file [SimpleJobRunner(join_logs=True)])

manage_fastqs.py

usage:
    manage_fastqs.py DIR
    manage_fastqs.py DIR PROJECT
    manage_fastqs.py DIR PROJECT copy [[user@]host:]DEST
    manage_fastqs.py DIR PROJECT md5
    manage_fastqs.py DIR PROJECT zip

Fastq management utility. If only DIR is supplied then list the projects; if
PROJECT is supplied then list the fastqs; 'copy' command copies fastqs for the
specified PROJECT to DEST on a local or remote server; 'md5' command generates
checksums for the fastqs; 'zip' command creates a zip file with the fastq
files.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --samples SAMPLE_LIST
                        list of names of samples to transfer
  --filter PATTERN      filter file names for reporting and copying based on
                        PATTERN
  --fastq_dir FASTQ_DIR
                        explicitly specify subdirectory of DIR with Fastq
                        files to run the QC on
  --max_zip_size MAX_ZIP_SIZE
                        for 'zip' command, defines the maximum size for the
                        output zip file; multiple zip files will be created if
                        the data exceeds this limit (default is create a
                        single zip file with no size limit)
  --link                hard link files instead of copying

run_qc.py

usage: run_qc.py [-h] [--version] [--info] [-n NAME] [-o OUT_DIR]
                 [--qc_dir QC_DIR] [-f FILENAME] [-u] [--organism ORGANISM]
                 [--library-type LIBRARY] [--single-cell-platform PLATFORM]
                 [--biological-samples SAMPLES] [-p PROTOCOL]
                 [--fastq-pattern PATTERN] [--fastq_subset SUBSET]
                 [-t NTHREADS] [--star-index INDEX] [--gtf GTF]
                 [--cellranger CELLRANGER_EXE]
                 [--cellranger-reference REFERENCE]
                 [--cellranger-probeset PROBE_SET]
                 [--cellranger-vdj-reference VDJ_REFERENCE]
                 [--10x_chemistry {ARC-v1,SC3Pv1,SC3Pv2,SC3Pv3,SC3Pv3HT,SC3Pv3LT,SC3Pv4,SC5P-PE,SC5P-PE-v3,SC5P-R2,SC5P-R2-v3,auto,fiveprime,threeprime}]
                 [--10x_force_cells N_CELLS]
                 [--10x_library [SAMPLE:]FASTQ_ID:FEATURE_TYPE]
                 [--10x_multiplexed_samples [SAMPLE:]MULTIPLEXED_SAMPLE=ID[,...]]
                 [--enable-conda {yes,no}] [--conda-env-dir CONDA_ENV_DIR]
                 [--local] [-c N] [-m M] [-j N] [-b NBATCHES]
                 [--protocol-spec SPECIFICATION] [--index-reads READS]
                 [--sequence-reads READS] [--qc-modules QC_MODULES]
                 [-r RUNNER] [-s N] [--ignore-metadata]
                 [--split-fastqs-by-lane] [--shorten-zip-paths {yes,no}]
                 [--use-legacy-screen-names {yes,no}] [--no-verify-fastqs]
                 [--no-multiqc] [--verbose] [--work-dir WORKING_DIR]
                 [--no-cleanup] [--fastq_screen_subset SUBSET] [--force]
                 [--multiqc]
                 DIR | FASTQ [FASTQ ...] [DIR | FASTQ [FASTQ ...] ...]

Run the QC pipeline standalone on an arbitrary set of Fastq files.

positional arguments:
  DIR | FASTQ [ FASTQ ... ]
                        directory or list of Fastq files to run the QC on

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --info                display information on protocols, organisms and other
                        settings (then exit)

Output and reporting:
  -n NAME, --name NAME  name for the project (used in report title)
  -o OUT_DIR, --out_dir OUT_DIR
                        top-level directory for reports and QC output
                        subdirectory (default: current working directory)
  --qc_dir QC_DIR       explicitly specify QC output directory. NB if a
                        relative path is supplied then it's assumed to be a
                        subdirectory of OUT_DIR (default: <OUT_DIR>/qc)
  -f FILENAME, --filename FILENAME
                        file name for output QC report (default:
                        <OUT_DIR>/<QC_DIR_NAME>_report.html)
  -u, --update          force QC pipeline to run even if output QC directory
                        already exists in <OUT_DIR> (default: stop if output
                        QC directory already exists)

Metadata:
  --organism ORGANISM   explicitly specify organism (e.g. 'human', 'mouse').
                        Multiple organisms should be separated by commas (e.g.
                        'human,mouse'). HINT use the --info option to list the
                        defined organisms
  --library-type LIBRARY
                        explicitly specify library type (e.g. 'RNA-seq',
                        'ChIP-seq')
  --single-cell-platform PLATFORM
                        explicitly specify the single cell platform (e.g.
                        '10xGenomics Chromium 3'v3')
  --biological-samples SAMPLES
                        explicitly specify subset of sample names with
                        biological data as comma-separated list (e.g.
                        'AB1,AB2,..') (default: assume that all samples
                        contain biological data)

QC options:
  -p PROTOCOL, --protocol PROTOCOL
                        explicitly specify the QC protocol to use; can be one
                        of 'minimal', 'standard', 'standardSE', 'standardPE',
                        '10x_scRNAseq', '10x_snRNAseq', '10x_scATAC',
                        '10x_Multiome_GEX', '10x_Multiome_ATAC', '10x_OCM',
                        '10x_CellPlex', '10x_Flex', '10x_ImmuneProfiling',
                        '10x_Visium_GEX', '10x_Visium_GEX_90bp_insert',
                        '10x_Visium_PEX', '10x_Visium_legacy',
                        'ParseEvercode', 'BioRad_ddSEQ_ATAC', 'singlecell'. If
                        not set then protocol will be determined automatically
                        based on directory contents and metadata.
  --fastq-pattern PATTERN
                        specify a custom pattern for identifying sample and
                        read number information from the Fastq names (e.g.
                        '{SAMPLE}_{READ}')
  --fastq_subset SUBSET
                        specify size of subset of reads to use for
                        FastQScreen, strandedness, coverage etc option);
                        (default 100000, set to 0 to use all reads)
  -t NTHREADS, --threads NTHREADS
                        number of threads to use for multicore jobs (default:
                        taken from job runners; ignored when using --local)

Reference data:
  --star-index INDEX    specify the path to the STAR genome index to use when
                        mapping reads for metrics such as strandedness etc
                        (overrides the organism-specific indexes defined in
                        the config file)
  --gtf GTF             specify the path to the GTF annotation file to use for
                        metrics such as 'qualimap rnaseq' (overrides the
                        organism-specific GTF files defined in the config
                        file)

Cellranger/10xGenomics options:
  --cellranger CELLRANGER_EXE
                        explicitly specify path to Cellranger executable to
                        use for single library analysis
  --cellranger-reference REFERENCE
                        specify the path to the reference dataset to use when
                        running single libary analysis (overrides the
                        organism-specific references defined in the config
                        file)
  --cellranger-probeset PROBE_SET
                        specify the path to the probe set reference dataset
                        for 'cellranger multi'
  --cellranger-vdj-reference VDJ_REFERENCE
                        specify the path to the VDJ reference dataset for
                        'cellranger multi'
  --10x_chemistry {ARC-v1,SC3Pv1,SC3Pv2,SC3Pv3,SC3Pv3HT,SC3Pv3LT,SC3Pv4,SC5P-PE,SC5P-PE-v3,SC5P-R2,SC5P-R2-v3,auto,fiveprime,threeprime}
                        assay configuration for 10xGenomics scRNA-seq; if set
                        to 'auto' (the default) then cellranger will attempt
                        to determine this automatically
  --10x_force_cells N_CELLS
                        force number of cells for 10xGenomics scRNA-seq and
                        scATAC-seq, overriding automatic cell detection
                        algorithms (default is to use built-in cell detection)
  --10x_library [SAMPLE:]FASTQ_ID:FEATURE_TYPE
                        assign feature type to a Fastq file ID for 'cellranger
                        multi' analysis (optionally also specify a physical
                        sample)
  --10x_multiplexed_samples [SAMPLE:]MULTIPLEXED_SAMPLE=ID[,...]
                        assign multiplexed samples name to CMO or probe IDs
                        for 'cellranger multi' analysis (optionally also
                        specify a physical sample)

Conda dependency resolution:
  --enable-conda {yes,no}
                        use conda to resolve task dependencies; can be 'yes'
                        or 'no' (default: no)
  --conda-env-dir CONDA_ENV_DIR
                        specify directory for conda enviroments (default:
                        temporary directory)

Job control options:
  --local               run the QC on the local system (overrides any runners
                        defined in the configuration or on the command line)
  -c N, --maxcores N    maximum number of cores available for QC jobs when
                        using --local (default no limit, change in in settings
                        file)
  -m M, --maxmem M      maximum total memory jobs can request at once when
                        using --local (in Gbs; default: unlimited)
  -j N, --maxjobs N     explicitly specify maximum number of concurrent QC
                        jobs to run (default 12, change in settings file;
                        ignored when using --local)
  -b NBATCHES, --maxbatches NBATCHES
                        enable dynamic batching of pipeline jobs with maximum
                        number of batches set to NBATCHES (default: no
                        batching)

Custom QC protocol:
  --protocol-spec SPECIFICATION
                        specify the QC protocol to use as a full specification
                        string
  --index-reads READS   explicitly specify the reads to treat as index
                        sequences (e.g. 'R2')
  --sequence-reads READS
                        explicitly specify the reads to treat as biological
                        sequences data (e.g. 'R1,R3')
  --qc-modules QC_MODULES
                        explicitly specify the QC modules to run (e.g.
                        'fastqc,fastq_screen')

Advanced options:
  -r RUNNER, --runner RUNNER
                        explicitly specify runner definition for running QC
                        components. RUNNER must be a valid job runner
                        specification e.g. 'GEJobRunner(-j y)' (default: use
                        runners set in configuration)
  -s N, --batch_size N  batch QC commands with N commands per job (default: no
                        batching)
  --ignore-metadata     ignore information from project metadata file even if
                        one is located (default is to use project metadata)
  --split-fastqs-by-lane
                        run QC on copies of input Fastqs where reads have been
                        split according to lane (default is to run QC on
                        original Fastqs)
  --shorten-zip-paths {yes,no}
                        shorten paths in the QC report ZIP file; can be 'yes'
                        or 'no' (default: no)
  --use-legacy-screen-names {yes,no}
                        use 'legacy' naming convention for FastqScreen output
                        files; can be 'yes' or 'no' (default: no)
  --no-verify-fastqs    skip Fastq verification (default: verify Fastqs before
                        running QC)
  --no-multiqc          turn off generation of MultiQC report

Debugging options:
  --verbose             run pipeline in 'verbose' mode
  --work-dir WORKING_DIR
                        specify the working directory for the pipeline
                        operations
  --no-cleanup          don't remove the temporary project directory on
                        completion (by default the temporary directory is
                        deleted)

Deprecated/redundant options:
  --fastq_screen_subset SUBSET
                        redundant: use the --fastq_subset option instead
  --force               redundant: HTML report generation will always be
                        attempted (even when pipeline fails)
  --multiqc             redundant: MultiQC report is generated by default (use
                        --no-multiqc to disable)

transfer_data.py

usage: transfer_data.py [-h] [--version] [--subdir {random_bin,run_id}]
                        [--samples SAMPLE_LIST] [--filter FILTER_PATTERN]
                        [--no_fastqs] [--zip_fastqs]
                        [--max_zip_size MAX_ZIP_SIZE]
                        [--readme README_TEMPLATE] [--weburl WEBURL]
                        [--include_qc_report] [--include_10x_outputs]
                        [--include_cloupe_files] [--include_visium_images]
                        [--include_downloader] [--link] [--runner RUNNER]
                        [--dry-run]
                        DEST PROJECT

Transfer copies of Fastq data from an analysis project to an arbitrary
destination for sharing with other people

positional arguments:
  DEST                  destination to copy Fastqs to; can be the name of a
                        destination defined in the configuration file, or an
                        arbitrary location of the form '[[USER@]HOST:]DIR' (no
                        destinations currently defined)
  PROJECT               path to project directory (or to a Fastqs subdirectory
                        in a project) to copy Fastqs from

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --subdir {random_bin,run_id}
                        subdirectory naming scheme: 'random_bin' locates a
                        random pre-existing empty subdirectory under the
                        target directory; 'run_id' creates a new subdirectory
                        'PLATFORM_DATESTAMP.RUN_ID-PROJECT'. If this option is
                        not set then no subdirectory will be used

Fastq selection:
  --samples SAMPLE_LIST
                        list of names of samples to transfer
  --filter FILTER_PATTERN
                        filter Fastq file names based on PATTERN
  --no_fastqs           don't copy Fastqs (other artefacts will be copied, if
                        specified)

ZIP file archives:
  --zip_fastqs          put Fastqs into a ZIP file
  --max_zip_size MAX_ZIP_SIZE
                        when using '--zip_fastqs' option, defines the maximum
                        size for the output zip file; multiple zip files will
                        be created if the data exceeds this limit (default is
                        create a single zip file with no size limit)

README generation:
  --readme README_TEMPLATE
                        template file to generate README file from; can be
                        full path to a template file, or the name of a file in
                        the 'templates' directory
  --weburl WEBURL       base URL for webserver (sets the value of the WEBURL
                        variable in the template README)

Additional artefacts:
  --include_qc_report   copy the zipped QC reports to the final location
  --include_10x_outputs
                        copy outputs from 10xGenomics pipelines (e.g.
                        'cellranger count') to the final location
  --include_cloupe_files
                        copy .cloupe files output from 10xGenomics pipelines
                        to the final location
  --include_visium_images
                        copy images for 10x Genomics Visium projects to the
                        final location
  --include_downloader  copy the 'download_fastqs.py' utility to the final
                        location

Advanced options:
  --link                hard link files instead of copying
  --runner RUNNER       specify the job runner to use for executing the
                        checksumming, Fastq copy and tar gzipping operations
                        (defaults to the 'transfer_data' job runner defined in
                        config file [SimpleJobRunner(join_logs=True)])
  --dry-run             report what would be done but don't actually transfer
                        any data