Utilities ========= .. note:: This documentation has been auto-generated from the command help In addition to the main ``auto_process.py`` command, a number of utilities are available: .. contents:: :local: .. _utilities_analyse_barcodes: analyse_barcodes.py ******************* :: usage: analyse_barcodes.py FASTQ [FASTQ...] analyse_barcodes.py DIR analyse_barcodes.py -c COUNTS_FILE [COUNTS_FILE...] Collate and report counts and statistics for Fastq index sequences (aka barcodes). If multiple Fastq files are supplied then sequences will be pooled before being analysed. If a single directory is supplied then this will be assumed to be an output directory from bcl2fastq and files will be processed on a per-lane basis. If the -c option is supplied then the input must be one or more file of barcode counts generated previously using the -o option. optional arguments: -h, --help show this help message and exit -v, --version show program's version number and exit Input and output options: -c, --counts input is one or more counts files generated by previous runs using the '-o/--output' option -o COUNTS_FILE_OUT, --output COUNTS_FILE_OUT output all counts to tab-delimited file COUNTS_FILE_OUT. This can be used again in another run by specifying the '-c' option Reporting options: -l LANES, --lanes LANES restrict analysis to the specified lane numbers (default is to process all lanes). Multiple lanes can be specified using ranges (e.g. '2-3'), comma- separated list ('5,7') or a mixture ('2-3,5,7') -m MISMATCHES, --mismatches MISMATCHES maximum number of mismatches to use when grouping similar barcodes (will be determined automatically if samplesheet is supplied, otherwise defaults to 0) --cutoff CUTOFF exclude barcodes/barcode groups from reporting with a smaller fraction of associated reads than CUTOFF, e.g. '0.01' excludes barcodes with < 1.0% of reads (default: 0.001) -s SAMPLE_SHEET, --sample-sheet SAMPLE_SHEET report best matches against barcodes in SAMPLE_SHEET -r REPORT_FILE, --report REPORT_FILE write report to REPORT_FILE (otherwise write to stdout) -x XLS_FILE, --xls XLS_FILE write XLS version of report to XLS_FILE -f HTML_FILE, --html HTML_FILE write HTML version of report to HTML_FILE -t TITLE, --title TITLE title for HTML report (default: 'Barcodes Report') -n, --no-report suppress reporting (overrides --report) Advanced options: --sequences count sequences instead of barcodes --seq-start SEQ_START specify first base of sequence to analyse (for --sequences option; default is start at the first base position) --seq-end SEQ_END specify last base of sequence to analyse (for --sequences option; default is start at the first base position) --minimum_read_fraction FRACTION weed out individual barcodes from initial analysis which have a smaller fraction of reads than FRACTION, e.g. '0.001' removes barcodes with < 0.1% of reads; speeds up analysis at the expense of accuracy as reported counts will be approximate (default: 1.0e-6) .. _utilities_assign_barcodes: assign_barcodes.py ****************** :: usage: assign_barcodes.py [-h] [-n N] INPUT.fq OUTPUT.fq Extract arbitrary sequence fragments from reads in INPUT.fq FASTQ file and assign these as the index (barcode) sequences in the read headers in OUTPUT.fq. positional arguments: INPUT.fq Input FASTQ file OUTPUT.fq Output FASTQ file optional arguments: -h, --help show this help message and exit -n N remove first N bases from each read and assign these as barcode index sequence (default: 5) .. _utilities_audit_projects: audit_projects.py ***************** :: usage: audit_projects.py [-h] [--pi PI_NAME] [--unassigned] [DIR [DIR ...]] Summarise the disk usage for runs that have been processed using auto_process. The supplied DIRs are directories holding one or more top-level analysis directories corresponding to different runs. The program reports total disk usage for projects assigned to each PI across all DIRs. positional arguments: DIR directory to search for analysis directories for auditing optional arguments: -h, --help show this help message and exit --pi PI_NAME list data for PI(s) matching PI_NAME (can use glob-style patterns) --unassigned list data for projects where PI is not assigned .. _utilities_build_index: build_index.py ************** :: usage: build_index.py [-h] [-v] [-o OUT_DIR] [--ebwt_base NAME] [--bt2_base NAME] [--overhang N] [--sa_index_nbases N] [-V VERSION] [-r RUNNER] ALIGNER FASTA [ANNOTATION] Generate indexes for aligners positional arguments: ALIGNER aligner to build index for (one of 'bowtie', 'bowtie2', 'star') FASTA FASTA file with sequence ANNOTATION annotation file (for use with STAR) optional arguments: -h, --help show this help message and exit -v, --version show program's version number and exit -o OUT_DIR output directory for indexes Bowtie-specific options: --ebwt_base NAME specify basename for output .ebwt files (defaults to FASTA file basename) Bowtie-specific options: --bt2_base NAME specify basename for output .bt2 files (defaults to FASTA file basename) STAR-specific options: --overhang N set value for STAR --sjdbOverhang option (default: 100) --sa_index_nbases N set value for STAR --genomeSAindexNbases option (default: use default in STAR) Advanced options: -V VERSION, --aligner-version VERSION specify the version of the aligner to target (only works if conda dependency resolution is configured) -r RUNNER, --runner RUNNER explicitly specify runner definition for building the index. RUNNER must be a valid job runner specification e.g. 'GEJobRunner(-pe smp.pe 8)' (default: use appropriate runner from configuration) .. _utilities_concat_fastqs: concat_fastqs.py **************** :: usage: concat_fastqs.py [-h] [--version] [-v] FASTQ [FASTQ ...] FASTQ_OUT Concatenate reads from one or more input Fastq files into a single new file FASTQ_OUT positional arguments: FASTQ Input FASTQ to concatenate FASTQ_OUT Output FASTQ with concatenated reads optional arguments: -h, --help show this help message and exit --version show program's version number and exit -v, --verbose verbose output .. _utilities_barcode_splitter: barcode_splitter.py ******************* :: usage: barcode_splitter.py [OPTIONS] FASTQ [FASTQ ...] barcode_splitter.py [OPTIONS] FASTQ_R1,FASTQ_R2 [FASTQ_R1,FASTQ_R2 ...] barcode_splitter.py [OPTIONS] DIR Split reads from one or more input Fastq files into new Fastqs based on matching supplied barcodes. optional arguments: -h, --help show this help message and exit --version show program's version number and exit -b INDEX_SEQ, --barcode INDEX_SEQ specify index sequence to filter using -m N_MISMATCHES, --mismatches N_MISMATCHES maximum number of differing bases to allow for two index sequences to count as a match. Default is zero (i.e. exact matches only) -n BASE_NAME, --name BASE_NAME basename to use for output files -o OUT_DIR, --output-dir OUT_DIR specify directory for output split Fastqs -u UNALIGNED_DIR, --unaligned UNALIGNED_DIR specify subdirectory with outputs from bcl2fastq -l LANE, --lane LANE specify lane to collect and split Fastqs for .. _utilities_download_fastqs: download_fastqs.py ****************** :: usage: download_fastqs.py [-h] URL [DIR] Download checksum file and fastqs from URL into current directory (or directory DIR, if specified), and verify the downloaded files against the checksum file. positional arguments: URL URL with checksum file and fastqs DIR directory to put downloaded fastqs into (defaults to current directory) optional arguments: -h, --help show this help message and exit .. _utilities_fastq_statistics: fastq_statistics.py ******************* :: usage: fastq_statistics.py [-h] [-v] [--unaligned UNALIGNED_DIR] [--sample-sheet SAMPLE_SHEET] [-o STATS_FILE] [-p PER_LANE_STATS_FILE] [-s PER_LANE_SAMPLE_STATS_FILE] [-f FULL_STATS_FILE] [-u] [-n N] [--debug] [--force] ILLUMINA_RUN_DIR Generate statistics for FASTQ files in ILLUMINA_RUN_DIR (top-level directory of a processed Illumina run) positional arguments: ILLUMINA_RUN_DIR input Illumina run directory optional arguments: -h, --help show this help message and exit -v, --version show program's version number and exit --unaligned UNALIGNED_DIR specify an alternative name for the 'Unaligned' directory containing the fastq.gz files --sample-sheet SAMPLE_SHEET specify a sample sheet file to get additional information from -o STATS_FILE, --output STATS_FILE name of output file for per-file statistics (default is 'statistics.info') -p PER_LANE_STATS_FILE, --per-lane-stats PER_LANE_STATS_FILE name of output file for per-lane statistics (default is 'per_lane_statistics.info') -s PER_LANE_SAMPLE_STATS_FILE, --per-lane-sample-stats PER_LANE_SAMPLE_STATS_FILE name of output file for per-lane statistics (default is 'per_lane_sample_stats.info') -f FULL_STATS_FILE, --full-stats FULL_STATS_FILE name of output file for full statistics (default is 'statistics_full.info') -u, --update update existing full statistics file with stats for additional files -n N, --nprocessors N spread work across N processors/cores (default is 1) --debug turn on debugging output Deprecated/defunct options: --force does nothing: retained for backwards compatibility .. _utilities_fetch_data: fetch_data.py ************* :: usage: fetch_data.py [-h] [--version] [--flatten] [--overwrite] [--runner RUNNER] SOURCE DEST Copy files and directories from arbitrary locations to the local system positional arguments: SOURCE source data (file or directory) to copy; can be on a local or remote file system. NB if source is a directory then the contents are copied (not the top-level directory) DEST destination directory on local file system optional arguments: -h, --help show this help message and exit --version show program's version number and exit --flatten copy files without replicating the source directory structure --overwrite overwrite existing files (default is to skip existing files) --runner RUNNER specify the job runner to use for executing 'rsync' operations (defaults to job runner defined for copying in config file [SimpleJobRunner(join_logs=True)]) .. _utilities_manage_fastqs: manage_fastqs.py **************** :: usage: manage_fastqs.py DIR manage_fastqs.py DIR PROJECT manage_fastqs.py DIR PROJECT copy [[user@]host:]DEST manage_fastqs.py DIR PROJECT md5 manage_fastqs.py DIR PROJECT zip Fastq management utility. If only DIR is supplied then list the projects; if PROJECT is supplied then list the fastqs; 'copy' command copies fastqs for the specified PROJECT to DEST on a local or remote server; 'md5' command generates checksums for the fastqs; 'zip' command creates a zip file with the fastq files. optional arguments: -h, --help show this help message and exit -v, --version show program's version number and exit --samples SAMPLE_LIST list of names of samples to transfer --filter PATTERN filter file names for reporting and copying based on PATTERN --fastq_dir FASTQ_DIR explicitly specify subdirectory of DIR with Fastq files to run the QC on --max_zip_size MAX_ZIP_SIZE for 'zip' command, defines the maximum size for the output zip file; multiple zip files will be created if the data exceeds this limit (default is create a single zip file with no size limit) --link hard link files instead of copying .. _utilities_run_qc: run_qc.py ********* :: usage: run_qc.py [-h] [--version] [--info] [-n NAME] [-o OUT_DIR] [--qc_dir QC_DIR] [-f FILENAME] [-u] [--organism ORGANISM] [--library-type LIBRARY] [--single-cell-platform PLATFORM] [--biological-samples SAMPLES] [-p PROTOCOL] [--fastq-pattern PATTERN] [--fastq_subset SUBSET] [-t NTHREADS] [--star-index INDEX] [--gtf GTF] [--cellranger CELLRANGER_EXE] [--cellranger-reference REFERENCE] [--cellranger-probeset PROBE_SET] [--cellranger-vdj-reference VDJ_REFERENCE] [--10x_chemistry {ARC-v1,SC3Pv1,SC3Pv2,SC3Pv3,SC3Pv3HT,SC3Pv3LT,SC3Pv4,SC5P-PE,SC5P-PE-v3,SC5P-R2,SC5P-R2-v3,auto,fiveprime,threeprime}] [--10x_force_cells N_CELLS] [--10x_library [SAMPLE:]FASTQ_ID:FEATURE_TYPE] [--10x_multiplexed_samples [SAMPLE:]MULTIPLEXED_SAMPLE=ID[,...]] [--enable-conda {yes,no}] [--conda-env-dir CONDA_ENV_DIR] [--local] [-c N] [-m M] [-j N] [-b NBATCHES] [--protocol-spec SPECIFICATION] [--index-reads READS] [--sequence-reads READS] [--qc-modules QC_MODULES] [-r RUNNER] [-s N] [--ignore-metadata] [--split-fastqs-by-lane] [--shorten-zip-paths {yes,no}] [--use-legacy-screen-names {yes,no}] [--no-verify-fastqs] [--no-multiqc] [--verbose] [--work-dir WORKING_DIR] [--no-cleanup] [--fastq_screen_subset SUBSET] [--force] [--multiqc] DIR | FASTQ [FASTQ ...] [DIR | FASTQ [FASTQ ...] ...] Run the QC pipeline standalone on an arbitrary set of Fastq files. positional arguments: DIR | FASTQ [ FASTQ ... ] directory or list of Fastq files to run the QC on optional arguments: -h, --help show this help message and exit --version show program's version number and exit --info display information on protocols, organisms and other settings (then exit) Output and reporting: -n NAME, --name NAME name for the project (used in report title) -o OUT_DIR, --out_dir OUT_DIR top-level directory for reports and QC output subdirectory (default: current working directory) --qc_dir QC_DIR explicitly specify QC output directory. NB if a relative path is supplied then it's assumed to be a subdirectory of OUT_DIR (default: /qc) -f FILENAME, --filename FILENAME file name for output QC report (default: /_report.html) -u, --update force QC pipeline to run even if output QC directory already exists in (default: stop if output QC directory already exists) Metadata: --organism ORGANISM explicitly specify organism (e.g. 'human', 'mouse'). Multiple organisms should be separated by commas (e.g. 'human,mouse'). HINT use the --info option to list the defined organisms --library-type LIBRARY explicitly specify library type (e.g. 'RNA-seq', 'ChIP-seq') --single-cell-platform PLATFORM explicitly specify the single cell platform (e.g. '10xGenomics Chromium 3'v3') --biological-samples SAMPLES explicitly specify subset of sample names with biological data as comma-separated list (e.g. 'AB1,AB2,..') (default: assume that all samples contain biological data) QC options: -p PROTOCOL, --protocol PROTOCOL explicitly specify the QC protocol to use; can be one of 'minimal', 'standard', 'standardSE', 'standardPE', '10x_scRNAseq', '10x_snRNAseq', '10x_scATAC', '10x_Multiome_GEX', '10x_Multiome_ATAC', '10x_OCM', '10x_CellPlex', '10x_Flex', '10x_ImmuneProfiling', '10x_Visium_GEX', '10x_Visium_GEX_90bp_insert', '10x_Visium_PEX', '10x_Visium_legacy', 'ParseEvercode', 'BioRad_ddSEQ_ATAC', 'singlecell'. If not set then protocol will be determined automatically based on directory contents and metadata. --fastq-pattern PATTERN specify a custom pattern for identifying sample and read number information from the Fastq names (e.g. '{SAMPLE}_{READ}') --fastq_subset SUBSET specify size of subset of reads to use for FastQScreen, strandedness, coverage etc option); (default 100000, set to 0 to use all reads) -t NTHREADS, --threads NTHREADS number of threads to use for multicore jobs (default: taken from job runners; ignored when using --local) Reference data: --star-index INDEX specify the path to the STAR genome index to use when mapping reads for metrics such as strandedness etc (overrides the organism-specific indexes defined in the config file) --gtf GTF specify the path to the GTF annotation file to use for metrics such as 'qualimap rnaseq' (overrides the organism-specific GTF files defined in the config file) Cellranger/10xGenomics options: --cellranger CELLRANGER_EXE explicitly specify path to Cellranger executable to use for single library analysis --cellranger-reference REFERENCE specify the path to the reference dataset to use when running single libary analysis (overrides the organism-specific references defined in the config file) --cellranger-probeset PROBE_SET specify the path to the probe set reference dataset for 'cellranger multi' --cellranger-vdj-reference VDJ_REFERENCE specify the path to the VDJ reference dataset for 'cellranger multi' --10x_chemistry {ARC-v1,SC3Pv1,SC3Pv2,SC3Pv3,SC3Pv3HT,SC3Pv3LT,SC3Pv4,SC5P-PE,SC5P-PE-v3,SC5P-R2,SC5P-R2-v3,auto,fiveprime,threeprime} assay configuration for 10xGenomics scRNA-seq; if set to 'auto' (the default) then cellranger will attempt to determine this automatically --10x_force_cells N_CELLS force number of cells for 10xGenomics scRNA-seq and scATAC-seq, overriding automatic cell detection algorithms (default is to use built-in cell detection) --10x_library [SAMPLE:]FASTQ_ID:FEATURE_TYPE assign feature type to a Fastq file ID for 'cellranger multi' analysis (optionally also specify a physical sample) --10x_multiplexed_samples [SAMPLE:]MULTIPLEXED_SAMPLE=ID[,...] assign multiplexed samples name to CMO or probe IDs for 'cellranger multi' analysis (optionally also specify a physical sample) Conda dependency resolution: --enable-conda {yes,no} use conda to resolve task dependencies; can be 'yes' or 'no' (default: no) --conda-env-dir CONDA_ENV_DIR specify directory for conda enviroments (default: temporary directory) Job control options: --local run the QC on the local system (overrides any runners defined in the configuration or on the command line) -c N, --maxcores N maximum number of cores available for QC jobs when using --local (default no limit, change in in settings file) -m M, --maxmem M maximum total memory jobs can request at once when using --local (in Gbs; default: unlimited) -j N, --maxjobs N explicitly specify maximum number of concurrent QC jobs to run (default 12, change in settings file; ignored when using --local) -b NBATCHES, --maxbatches NBATCHES enable dynamic batching of pipeline jobs with maximum number of batches set to NBATCHES (default: no batching) Custom QC protocol: --protocol-spec SPECIFICATION specify the QC protocol to use as a full specification string --index-reads READS explicitly specify the reads to treat as index sequences (e.g. 'R2') --sequence-reads READS explicitly specify the reads to treat as biological sequences data (e.g. 'R1,R3') --qc-modules QC_MODULES explicitly specify the QC modules to run (e.g. 'fastqc,fastq_screen') Advanced options: -r RUNNER, --runner RUNNER explicitly specify runner definition for running QC components. RUNNER must be a valid job runner specification e.g. 'GEJobRunner(-j y)' (default: use runners set in configuration) -s N, --batch_size N batch QC commands with N commands per job (default: no batching) --ignore-metadata ignore information from project metadata file even if one is located (default is to use project metadata) --split-fastqs-by-lane run QC on copies of input Fastqs where reads have been split according to lane (default is to run QC on original Fastqs) --shorten-zip-paths {yes,no} shorten paths in the QC report ZIP file; can be 'yes' or 'no' (default: no) --use-legacy-screen-names {yes,no} use 'legacy' naming convention for FastqScreen output files; can be 'yes' or 'no' (default: no) --no-verify-fastqs skip Fastq verification (default: verify Fastqs before running QC) --no-multiqc turn off generation of MultiQC report Debugging options: --verbose run pipeline in 'verbose' mode --work-dir WORKING_DIR specify the working directory for the pipeline operations --no-cleanup don't remove the temporary project directory on completion (by default the temporary directory is deleted) Deprecated/redundant options: --fastq_screen_subset SUBSET redundant: use the --fastq_subset option instead --force redundant: HTML report generation will always be attempted (even when pipeline fails) --multiqc redundant: MultiQC report is generated by default (use --no-multiqc to disable) .. _utilities_transfer_data: transfer_data.py **************** :: usage: transfer_data.py [-h] [--version] [--subdir {random_bin,run_id}] [--samples SAMPLE_LIST] [--filter FILTER_PATTERN] [--no_fastqs] [--zip_fastqs] [--max_zip_size MAX_ZIP_SIZE] [--readme README_TEMPLATE] [--weburl WEBURL] [--include_qc_report] [--include_10x_outputs] [--include_cloupe_files] [--include_visium_images] [--include_downloader] [--link] [--runner RUNNER] [--dry-run] DEST PROJECT Transfer copies of Fastq data from an analysis project to an arbitrary destination for sharing with other people positional arguments: DEST destination to copy Fastqs to; can be the name of a destination defined in the configuration file, or an arbitrary location of the form '[[USER@]HOST:]DIR' (no destinations currently defined) PROJECT path to project directory (or to a Fastqs subdirectory in a project) to copy Fastqs from optional arguments: -h, --help show this help message and exit --version show program's version number and exit --subdir {random_bin,run_id} subdirectory naming scheme: 'random_bin' locates a random pre-existing empty subdirectory under the target directory; 'run_id' creates a new subdirectory 'PLATFORM_DATESTAMP.RUN_ID-PROJECT'. If this option is not set then no subdirectory will be used Fastq selection: --samples SAMPLE_LIST list of names of samples to transfer --filter FILTER_PATTERN filter Fastq file names based on PATTERN --no_fastqs don't copy Fastqs (other artefacts will be copied, if specified) ZIP file archives: --zip_fastqs put Fastqs into a ZIP file --max_zip_size MAX_ZIP_SIZE when using '--zip_fastqs' option, defines the maximum size for the output zip file; multiple zip files will be created if the data exceeds this limit (default is create a single zip file with no size limit) README generation: --readme README_TEMPLATE template file to generate README file from; can be full path to a template file, or the name of a file in the 'templates' directory --weburl WEBURL base URL for webserver (sets the value of the WEBURL variable in the template README) Additional artefacts: --include_qc_report copy the zipped QC reports to the final location --include_10x_outputs copy outputs from 10xGenomics pipelines (e.g. 'cellranger count') to the final location --include_cloupe_files copy .cloupe files output from 10xGenomics pipelines to the final location --include_visium_images copy images for 10x Genomics Visium projects to the final location --include_downloader copy the 'download_fastqs.py' utility to the final location Advanced options: --link hard link files instead of copying --runner RUNNER specify the job runner to use for executing the checksumming, Fastq copy and tar gzipping operations (defaults to the 'transfer_data' job runner defined in config file [SimpleJobRunner(join_logs=True)]) --dry-run report what would be done but don't actually transfer any data