`auto_process_ngs.qc.pipeline`

Pipeline components for running the QC pipeline.

Pipeline classes:

QCPipeline

Pipeline task classes:

SetupQCDirs
SplitFastqsByLane
GetSequenceDataSamples
GetSequenceDataFastqs
UpdateQCMetadata
VerifyFastqs
GetSeqLengthStats
CheckFastqScreenOutputs
RunFastqScreen
CheckFastQCOutputs
RunFastQC
SetupFastqStrandConf
CheckFastqStrandOutputs
RunFastqStrand
DetermineRequired10xPackage
GetCellrangerReferenceData
MakeCellrangerArcCountLibraries
GetCellrangerMultiConfig
CheckCellrangerCountOutputs
RunCellrangerCount
RunCellrangerMulti
SetCellCountFromCellranger
GetReferenceDataset
GetBAMFiles
RunRSeQCGenebodyCoverage
RunPicardCollectInsertSizeMetrics
CollateInsertSizes
ConvertGTFToBed
RunRSeQCInferExperiment
RunQualimapRnaseq
ReportQC

Also imports the following pipeline tasks:

Get10xPackage

class auto_process_ngs.qc.pipeline.CheckCellrangerCountOutputs(_name, *args, **kws)

Check the outputs from cellranger(-atac) count

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, fastq_dir=None, samples=None, qc_dir=None, qc_module=None, extra_projects=None, cellranger_version=None, cellranger_ref_data=None, verbose=False)

Initialise the CheckCellrangerCountOutputs task.

Parameters:

project (AnalysisProject) – project to run QC for
fastq_dir (str) – directory holding Fastq files
samples (list) – list of samples to restrict checks to (all samples in project are checked by default)
qc_dir (str) – top-level QC directory to look for ‘count’ QC outputs (e.g. metrics CSV and summary HTML files)
qc_module (str) – QC protocol being used
extra_projects (list) – optional list of extra AnalysisProjects to include Fastqs from when running cellranger pipeline
cellranger_version (str) – version number of 10xGenomics package
cellranger_ref_data (str) – name or path to reference dataset for single library analysis
verbose (bool) – if True then print additional information from the task

Outputs:

fastq_dir (PipelineParam): pipeline parameter: instance that resolves to a string with the path to directory with Fastq files
samples (list): list of sample names that have: missing outputs from ‘cellranger count’

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.CheckFastQCOutputs(_name, *args, **kws)

Check the outputs from FastQC

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, qc_dir, read_numbers, fastqs=None, verbose=False)

Initialise the CheckFastQCOutputs task.

Parameters:

project (AnalysisProject) – project to run QC for
qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)
read_numbers (list) – list of read numbers to include
fastqs (list) – optional, list of Fastq files (overrides Fastqs in project)
verbose (bool) – if True then print additional information from the task

Outputs:

fastqs (list): list of Fastqs that have: missing FastQC outputs under the specified QC protocol

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.CheckFastqScreenOutputs(_name, *args, **kws)

Check the outputs from FastqScreen

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, qc_dir, screens, fastqs=None, read_numbers=None, include_samples=None, fastq_attrs=None, legacy=False, verbose=False)

Initialise the CheckFastqScreenOutputs task.

Parameters:

project (AnalysisProject) – project to run QC for
qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)
screens (mapping) – mapping of screen names to FastqScreen conf files
fastqs (list) – explicit list of Fastq files to check against (default is to use Fastqs from supplied analysis project)
read_numbers (list) – read numbers to include
include_samples (list) – optional, list of sample names to include
fastq_attrs (BaseFastqAttrs) – class to use for extracting data from Fastq names
legacy (bool) – if True then use ‘legacy’ naming convention for output files (default is to use new format)
verbose (bool) – if True then print additional information from the task

Outputs:

fastqs (list): list of Fastqs that have: missing FastqScreen outputs under the specified QC protocol

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.CheckFastqStrandOutputs(_name, *args, **kws)

Check the outputs from the fastq_strand.py utility

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, qc_dir, fastq_strand_conf, fastqs=None, read_numbers=None, include_samples=None, verbose=False)

Initialise the CheckFastqStrandOutputs task.

Parameters:

project (AnalysisProject) – project to run QC for
qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)
fastq_strand_conf (str) – path to the fastq_strand config file
fastqs (list) – explicit list of Fastq files to check against (default is to use Fastqs from supplied analysis project)
read_numbers (list) – list of read numbers to include when checking outputs
include_samples (list) – optional, list of sample names to include
verbose (bool) – if True then print additional information from the task

Outputs:

fastq_pairs (list): list of tuples with Fastq: “pairs” that have missing outputs from fastq_strand.py under the specified QC protocol. A “pair” may be an (R1,R2) tuple, or a single Fastq (e.g. (fq,)).

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.CollateInsertSizes(_name, *args, **kws)

Collate insert size metrics data from multiple BAMs

Gathers together the Picard insert size data from a set of BAM files and puts them into a single TSV file.

init(bam_files, picard_out_dir, out_file, delimiter='\t')

Initialise the CollateInsertSizes task

Parameters:

bam_files (list) – list of paths to BAM files to get associated insert size data for
picard_out_dir (str) – path to the directory containing the Picard CollectInsertSizeMetrics output files
out_file (str) – path to the output TSV file
delimiter (str) – specify the delimiter to use in the output file

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.ConvertGTFToBed(_name, *args, **kws)

Convert a GTF file to a BED file using BEDOPS ‘gtf2bed’

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(gtf_in, bed_out)

Initialise the ConvertGTFToBed task

Parameters:

gtf_in (str) – path to the input GTF file
bed_out (str) – path to the output BED file

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.DetermineRequired10xPackage(_name, *args, **kws)

Determine which 10xGenomics software package is required

By default determines the package name based on the supplied QC module, but this can be overridden by explicitly supplying a required package (which can also be a path to an executable).

The output ‘require_cellranger’ parameter should be supplied to the ‘Get10xPackage’ task, which will do the job of actually locating an executable.

init(qc_module, require_cellranger=None)

Initialise the DetermineRequired10xPackage task

Argument:

qc_module (str): QC module being used require_cellranger (str): optional package name

or path to an executable; if supplied then overrides the automatic package determination

Outputs:

require_cellranger (pipelineParam): the 10xGenomics: software package name or path to use

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.GetBAMFiles(_name, *args, **kws)

Create BAM files from Fastqs using STAR

Runs STAR to generate BAM files from Fastq files. The BAMs are then sorted and indexed using samtools.

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(fastqs, star_index, out_dir, subset_size=None, nthreads=None, reads=None, include_samples=None, fastq_attrs=None, verbose=False)

Initialise the GetBamFiles task

Parameters:

fastqs (list) – list of Fastq files to generate BAM files from
star_index (str) – path to STAR index to use
out_dir (str) – path to directory to write final BAM files to
subset_size (int) – specify size of a random subset of reads to use in BAM file generation
nthreads (int) – number of cores for STAR to use
reads (list) – optional, list of read numbers to include (e.g. [1,2], [2] etc)
include_samples (list) – optional, list of sample names to include
fastq_attrs (IlluminaFastq) – optional, class to use for extracting information from Fastq file names
verbose (bool) – if True then print additional information from the task

Outputs:: bam_files: list of sorted BAM files

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.GetCellrangerMultiConfig(_name, *args, **kws)

Locate ‘config.csv’ files for cellranger multi

init(project, qc_dir)

Initialise the GetCellrangerMultiConfig task.

Parameters:

project (AnalysisProject) – project to run QC for
qc_dir (str) – top-level QC directory to put ‘config.csv’ files

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.GetCellrangerReferenceData(_name, *args, **kws)

init(project, organism=None, transcriptomes=None, premrna_references=None, atac_references=None, multiome_references=None, cellranger_exe=None, cellranger_version=None, force_reference_data=None)

Initialise the GetCellrangerReferenceData task

Parameters:

project (AnalysisProject) – project to run QC for
organism (str) – if supplied then must be a string with the names of one or more organisms, with multiple organisms separated by spaces (defaults to the organisms associated with the project)
transcriptomes (mapping) – mapping of organism names to reference transcriptome data for cellranger
premrna_references (mapping) – mapping of organism names to “pre-mRNA” reference data for cellranger
atac_references (mapping) – mapping of organism names to reference genome data for cellranger-atac
multiome_references (mapping) – mapping of organism names to reference datasets for cellranger-arc
cellranger_exe (str) – the path to the Cellranger software package to use (e.g. ‘cellranger’, ‘cellranger-atac’, ‘spaceranger’)
cellranger_version (str) – the version string for the Cellranger package
force_reference_data (str) – if supplied then will be used as the reference dataset, instead of trying to locate appropriate reference data automatically

Outputs:

reference_data_path (PipelineParam): pipeline: parameter instance which resolves to a string with the path to the reference data set corresponding to the supplied organism.

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.GetReferenceDataset(_name, *args, **kws)

Acquire reference data for an organism from mapping

Generic lookup task which attempts to locate the matching reference dataset from a mapping/dictionary.

init(organism, references, force_reference=None)

Initialise the GetReferenceDataset task

Parameters:

organism (str) – name of the organism
references (mapping) – mapping with organism names as keys and reference datasets as corresponding values
force_reference (str) – if specified then return the supplied value instead of determining from the organism

Outputs:

reference_dataset: reference dataset (set to None: if no dataset could be located)

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.GetSeqLengthStats(_name, *args, **kws)

Get data on sequence lengths, masking and padding for Fastqs in a project, and write the data to JSON files.

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, qc_dir, read_numbers=None, fastqs=None, fastq_attrs=None)

Initialise the GetSeqLengthStats task

Parameters:

project (AnalysisProject) – project with Fastqs to get the sequence length data from
qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)
read_numbers (sequence) – list of read numbers to include (or None to include all reads)
fastqs (list) – optional, list of Fastq files (overrides Fastqs in project)
fastq_attrs (BaseFastqAttrs) – class to use for extracting data from Fastq names

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.GetSequenceDataFastqs(_name, *args, **kws)

Set up Fastqs with sequence (i.e. biological) data

init(project, out_dir, read_range, samples, fastq_attrs, fastqs=None)

Initialise the GetSequenceDataFastqs task

Parameters:

project (AnalysisProject) – project to get Fastqs for
out_dir (str) – path to directory to write final Fastq files to
read_range (dict) – mapping of read names to tuples of subsequence ranges
samples (list) – list of samples with sequence data
fastq_attrs (BaseFastqAttrs) – class to use for extracting data from Fastq names
fastqs (list) – optional, list of Fastq files (overrides Fastqs in project)

Outputs:

fastqs (list): list of Fastqs with biological: data

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.GetSequenceDataSamples(_name, *args, **kws)

Identify samples with sequence (i.e. biological) data

init(project, fastq_attrs)

Initialise the GetSequenceDataSamples task

Parameters:

project (AnalysisProject) – project to get samples for
fastq_attrs (BaseFastqAttrs) – class to use for extracting data from Fastq names

Outputs:

seq_data_samples (list): list of samples with: biological data

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.MakeCellrangerArcCountLibraries(_name, *args, **kws)

Make ‘libraries.csv’ files for cellranger-arc count

init(project, qc_dir)

Initialise the MakeCellrangerArcCountLibraries task.

Parameters:

project (AnalysisProject) – project to run QC for
qc_dir (str) – top-level QC directory to put ‘libraries.csv’ files

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.QCPipeline

Run the QC pipeline on one or more projects

Pipeline to run QC on multiple projects.

Example usage:

>>> qc = QCPipeline()
>>> qc.add_project(AnalysisProject("AB","./AB")
>>> qc.add_project(AnalysisProject("CDE","./CDE")
>>> qc.run()

add_cellranger_count(project_name, project, qc_dir, organism, fastq_dir, qc_module_name, library_type, chemistry, force_cells, samples=None, fastq_dirs=None, reference_dataset=None, extra_projects=None, required_tasks=None)

Add tasks to pipeline to run ‘cellranger* count’

Parameters:

project_name (str) – name to associate with project for reporting tasks
project (AnalysisProject) – project to run 10x cellranger pipeline within
qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)
organism (str) – organism for pipeline
fastq_dir (str) – directory holding Fastq files
qc_module (str) – QC module being used
library_type (str) – type of data being analysed (e.g. ‘scRNA-seq’)
chemistry (str) – chemistry to use in single library analysis
force_cells (int) – if set then bypasses the cell detection algorithm in ‘cellranger’ and ‘cellranger-atac’ using the ‘–force-cells’ option (does nothing for ‘cellranger-arc’)
samples (list) – optional, list of samples to restrict single library analyses to (or None to use all samples in project)
fastq_dirs (dict) – optional, a dictionary mapping sample names to Fastq directories which will be used to override the paths set by the ‘fastq_dirs’ argument
reference_dataset (str) – optional, path to reference dataset (otherwise will be determined automatically based on organism)
extra_projects (list) – optional list of extra AnalysisProjects to include Fastqs from when running cellranger pipeline
required_tasks (list) – list of tasks that the cellranger pipeline should wait for

add_project(project, protocol, qc_dir=None, organism=None, fastq_dir=None, report_html=None, multiqc=False, sample_pattern=None, log_dir=None, convert_gtf=True, verify_fastqs=False, split_fastqs_by_lane=False)

Add a project to the QC pipeline

Parameters:

project (AnalysisProject) – project to run QC for
protocol (QCProtocol) – QC protocol to use
qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)
organism (str) – organism(s) for project (defaults to organism defined in project metadata)
fastq_dir (str) – directory holding Fastq files (defaults to primary fastq_dir in project)
multiqc (bool) – if True then also run MultiQC (default is not to run MultiQC)
sample_pattern (str) – glob-style pattern to match a subset of projects and samples (not implemented)
log_dir (str) – directory to write log files to (defaults to ‘logs’ subdirectory of the QC directory)
convert_gtf (bool) – if True then convert GTF files to BED for ‘infer_experiment.py’ (default; otherwise only use the explicitly defined BED files)
verify_fastqs (bool) – if True then verify Fastq integrity as part of the pipeline (default: False, skip verification)
split_fastqs_by_lanes (bool) – if True then split input Fastqs into lanes and run QC as per-lane (default: False, don’t split QC by lanes)

add_task(task, requires=(), **kws)

Override base class method

Automatically set log dir when tasks are added

property default_log_dir: Return current value of default log dir

run(nthreads=None, fastq_screens=None, star_indexes=None, annotation_bed_files=None, annotation_gtf_files=None, fastq_subset=None, cellranger_chemistry='auto', cellranger_force_cells=None, cellranger_transcriptomes=None, cellranger_premrna_references=None, cellranger_atac_references=None, cellranger_arc_references=None, cellranger_jobmode='local', cellranger_maxjobs=None, cellranger_mempercore=None, cellranger_jobinterval=None, cellranger_localcores=None, cellranger_localmem=None, cellranger_exe=None, cellranger_extra_projects=None, cellranger_reference_dataset=None, cellranger_out_dir=None, force_star_index=None, force_gtf_annotation=None, working_dir=None, log_file=None, batch_size=None, batch_limit=None, max_jobs=1, max_slots=None, poll_interval=5, runners=None, default_runner=None, enable_conda=False, conda=None, conda_env_dir=None, envmodules=None, legacy_screens=False, verbose=False)

Run the tasks in the pipeline

Parameters:

nthreads (int) – number of threads/processors to use for QC jobs (defaults to number of slots set in job runners)
fastq_screens (dict) – mapping of screen IDs to FastqScreen conf files
star_indexes (dict) – mapping of organism IDs to directories with STAR indexes
annotation_bed_files (dict) – mapping of organism IDs to BED files with annotation data
annotation_gtf_files (dict) – mapping of organism IDs to GTF files with annotation data
fastq_subset (int) – explicitly specify the subset size for subsetting running Fastqs
cellranger_chemistry (str) – explicitly specify the assay configuration (set to ‘auto’ to let cellranger determine this automatically; ignored if not scRNA-seq)
force_cells (int) – explicitly specify number of cells for ‘cellranger’ and ‘cellranger-atac’ (set to ‘None’ to use the cell detection algorithm; ignored for ‘cellranger-arc’)
cellranger_transcriptomes (mapping) – mapping of organism names to reference transcriptome data for cellranger
cellranger_premrna_references (mapping) – mapping of organism names to “pre-mRNA” reference data for cellranger
cellranger_atac_references (mapping) – mapping of organism names to ATAC-seq reference genome data for cellranger-atac
cellranger_arc_references (mapping) – mapping of organism names to multiome reference datasets for cellranger-arc
cellranger_jobmode (str) – specify the job mode to pass to cellranger (default: “local”)
cellranger_maxjobs (int) – specify the maximum number of jobs to pass to cellranger (default: None)
cellranger_mempercore (int) – specify the memory per core (in Gb) to pass to cellranger (default: None)
cellranger_jobinterval (int) – specify the interval between launching jobs (in ms) to pass to cellranger (default: None)
cellranger_localcores (int) – maximum number of cores cellranger can request in jobmode ‘local’ (default: None)
cellranger_localmem (int) – maximum memory cellranger can request in jobmode ‘local’ (default: None)
cellranger_exe (str) – optional, explicitly specify the cellranger executable to use for single library analysis (default: cellranger executable is determined automatically)
cellranger_extra_projects (list) – optional list of extra AnalysisProjects to include Fastqs from when running cellranger pipeline
cellranger_reference_dataset (str) – optional, explicitly specify the path to the reference dataset to use for single library analysis (default: reference dataset is determined automatically)
cellranger_out_dir (str) – specify directory to put full cellranger outputs into (default: project directory)
force_star_index (str) – explicitly specify STAR index to use (default: index is determined automatically)
force_gtf_annotation (str) – explicitly specify GTF annotation to use (default: annotation file is determined automatically)
working_dir (str) – optional path to a working directory (defaults to temporary directory in the current directory)
log_dir (str) – path of directory where log files will be written to
batch_size (int) – if set then run commands in each task in batches, with each batch running this many commands at a time (default is to run one command per job)
batch_limit (int) – if set then run commands in each task in batches, with the batch size set dyanmically so as not to exceed this limit (default is to use fixed batch sizes)
max_jobs (int) – optional maximum number of concurrent jobs in scheduler (defaults to 1)
max_slots (int) – optional maximum number of ‘slots’ (i.e. concurrent threads or maximum number of CPUs) available to the scheduler (defaults to no limit)
poll_interval (float) – optional polling interval (seconds) to set in scheduler (defaults to 5s)
runners (dict) – mapping of names to JobRunner instances; valid names are ‘fastqc_runner’, ‘fastq_screen_runner’,’star_runner’,’rseqc_runner’, ‘qualimap_runner’,’cellranger_count_runner’, ‘cellranger_multi_runner’,’report_runner’, ‘verify_runner’, and ‘default’
enable_conda (bool) – if True then enable use of conda environments to satisfy task dependencies
conda (str) – path to conda
conda_env_dir (str) – path to non-default directory for conda environments
envmodules (mapping) – mapping of names to environment module file lists; valid names are ‘fastqc’,’fastq_screen’,’fastq_strand’,’cellranger’, ‘report_qc’
default_runner (JobRunner) – optional default job runner to use
legacy_screens (bool) – if True then use ‘legacy’ naming convention for FastqScreen outputs
verbose (bool) – if True then report additional information for diagnostics

set_default_log_dir(log_dir): Set the default log directory for tasks

class auto_process_ngs.qc.pipeline.ReportQC(_name, *args, **kws)

Generate the QC report

init(project, qc_dir, report_html=None, fastq_dir=None, multiqc=False, force=False, zip_outputs=True)

Initialise the ReportQC task.

Parameters:

project (AnalysisProject) – project to generate QC report for
qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)
report_html (str) – set the name of the output HTML file for the report
fastq_dir (str) – directory holding Fastq files (defaults to current fastq_dir in project)
multiqc (bool) – if True then also generate MultiQC report (default: don’t run MultiQC)
force (bool) – if True then force HTML report to be generated even if QC outputs fail verification (default: don’t write report)
zip_outputs (bool) – if True then also generate a ZIP archive of the QC reports

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.RunCellrangerCount(_name, *args, **kws)

Run ‘cellranger count’

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(samples, fastq_dir, reference_data_path, library_type, out_dir, qc_dir=None, cellranger_exe=None, cellranger_version=None, chemistry='auto', fastq_dirs=None, force_cells=None, cellranger_jobmode='local', cellranger_maxjobs=None, cellranger_mempercore=None, cellranger_jobinterval=None, cellranger_localcores=None, cellranger_localmem=None)

Initialise the RunCellrangerCount task.

Parameters:

samples (list) – list of sample names to run cellranger count on (it is expected that this list will come from the CheckCellrangerCountsOutputs task)
fastq_dir (str) – path to directory holding the Fastq files
reference_data_path (str) – path to the cellranger compatible reference dataset
library_type (str) – type of data being analysed (e.g. ‘scRNA-seq’)
out_dir (str) – top-level directory to copy all final ‘count’ outputs into. Outputs won’t be copied if no value is supplied
qc_dir (str) – top-level QC directory to put ‘count’ QC outputs (e.g. metrics CSV and summary HTML files) into. Outputs won’t be copied if no value is supplied
cellranger_exe (str) – the path to the Cellranger software package to use (e.g. ‘cellranger’, ‘cellranger-atac’, ‘spaceranger’)
cellranger_version (str) – the version string for the Cellranger package
fastq_dirs (dict) – optional, a dictionary mapping sample names to Fastq directories which will be used to override the paths set by the ‘fastq_dirs’ argument
force_cells (int) – optional, if set then bypasses the cell detection algorithm in ‘cellranger’ and ‘cellranger-atac’ using the ‘–force-cells’ option (does nothing for ‘cellranger-arc’)
chemistry (str) – assay configuration (set to ‘auto’ to let cellranger determine this automatically; ignored if not scRNA-seq)
cellranger_jobmode (str) – specify the job mode to pass to cellranger (default: “local”)
cellranger_maxjobs (int) – specify the maximum number of jobs to pass to cellranger (default: None)
cellranger_mempercore (int) – specify the memory per core (in Gb) to pass to cellranger (default: None)
cellranger_jobinterval (int) – specify the interval between launching jobs (in ms) to pass to cellranger (default: None)
cellranger_localcores (int) – maximum number of cores cellranger can request in jobmode ‘local’ (defaults to number of slots set in runner)
cellranger_localmem (int) – maximum memory cellranger can request in jobmode ‘local’ (default: None)

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.RunCellrangerMulti(_name, *args, **kws)

Run ‘cellranger multi’

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, config_csvs, samples, reference_data_path, out_dir, qc_dir=None, cellranger_exe=None, cellranger_version=None, cellranger_jobmode='local', cellranger_maxjobs=None, cellranger_mempercore=None, cellranger_jobinterval=None, cellranger_localcores=None, cellranger_localmem=None, working_dir=None)

Initialise the RunCellrangerMulti task.

Parameters:

project (AnalysisProject) – project to run QC for
config_csvs (list) – list of paths to ‘cellranger multi’ configuration files
samples (list) – list of sample names from the config.csv file
reference_data_path (str) – path to the cellranger compatible reference dataset from the config.csv file
out_dir (str) – top-level directory to copy all final ‘count’ outputs into. Outputs won’t be copied if no value is supplied
qc_dir (str) – top-level QC directory to put ‘count’ QC outputs (e.g. metrics CSV and summary HTML files) into. Outputs won’t be copied if no value is supplied
cellranger_exe (str) – the path to the Cellranger software package to use (e.g. ‘cellranger’, ‘cellranger-atac’, ‘spaceranger’)
cellranger_version (str) – the version string for the Cellranger package
cellranger_jobmode (str) – specify the job mode to pass to cellranger (default: “local”)
cellranger_maxjobs (int) – specify the maximum number of jobs to pass to cellranger (default: None)
cellranger_mempercore (int) – specify the memory per core (in Gb) to pass to cellranger (default: None)
cellranger_jobinterval (int) – specify the interval between launching jobs (in ms) to pass to cellranger (default: None)
cellranger_localcores (int) – maximum number of cores cellranger can request in jobmode ‘local’ (defaults to number of slots set in runner)
cellranger_localmem (int) – maximum memory cellranger can request in jobmode ‘local’ (default: None)

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.RunFastQC(_name, *args, **kws)

Run FastQC

init(fastqs, qc_dir, nthreads=None)

Initialise the RunIlluminaQC task.

Parameters:

fastqs (list) – list of paths to Fastq files to run Fastq Screen on (it is expected that this list will come from the CheckIlluminaQCOutputs task)
qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)
nthreads (int) – number of threads/processors to use (defaults to number of slots set in runner)

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.RunFastqScreen(_name, *args, **kws)

Run FastqScreen

init(fastqs, qc_dir, screens, subset=None, nthreads=None, read_numbers=None, fastq_attrs=None, legacy=False)

Initialise the RunFastqScreen task.

Parameters:

fastqs (list) – list of paths to Fastq files to run Fastq Screen on (it is expected that this list will come from the CheckIlluminaQCOutputs task)
qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)
screens (mapping) – mapping of screen names to FastqScreen conf files
subset (int) – explicitly specify the subset size for running Fastq_screen
nthreads (int) – number of threads/processors to use (defaults to number of slots set in runner)
read_numbers (list) – list of read numbers to include when running Fastq Screen
fastq_attrs (BaseFastqAttrs) – class to use for extracting data from Fastq names
legacy (bool) – if True then use ‘legacy’ naming convention for output files (default is to use new format)

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.RunFastqStrand(_name, *args, **kws)

Run the fastq_strand.py utility

init(fastq_pairs, qc_dir, fastq_strand_conf, fastq_strand_subset=None, nthreads=None)

Initialise the RunFastqStrand task.

Parameters:

fastq_pairs (list) – list of tuples with “pairs” of Fastq files to run fastq_strand.py on (it is expected that this list will come from the CheckFastqStrandOutputs task)
qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)
fastq_strand_conf (str) – path to the fastq_strand config file to use
fastq_strand_subset (int) – explicitly specify the subset size for running fastq_strand
nthreads (int) – number of threads/processors to use (defaults to number of slots set in job runner)

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.RunPicardCollectInsertSizeMetrics(_name, *args, **kws)

Run Picard ‘CollectInsertSizeMetrics’ on BAM files

Given a list of BAM files, for each file first runs the Picard ‘CleanSam’ utility (to remove alignments that would otherwise cause problems for the insert size calculations) and then ‘CollectInsertSizeMetrics’ to generate the insert size metrics.

Note that this task should only be run on BAM files with paired-end data.

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(bam_files, out_dir)

Initialise the RunPicardCollectInsertSizeMetrics task

Parameters:

bam_files (list) – list of paths to BAM files to run CollectInsertSizeMetrics on
out_dir (str) – path to a directory where the output files will be written

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.RunQualimapRnaseq(_name, *args, **kws)

Run Qualimap’s ‘rnaseq’ module on BAM files

Given a list of BAM files, for each file runs the Qualimap ‘rnaseq’ module (http://qualimap.conesalab.org/doc_html/command_line.html#rna-seq-qc)

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(bam_files, feature_file, out_dir, bam_properties)

Initialise the RunQualimapRnaseq task

Parameters:

bam_files (list) – list of paths to BAM files to run Qualimap rnaseq on
feature_file (str) – path to GTF file with the reference annotation data
out_dir (str) – path to a directory where the output files will be written
bam_properties (mapping) – properties for each BAM file from RSeQC ‘infer_experiment.py’ (used to determine if BAM is paired and what the strand-specificity is)

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.RunRSeQCGenebodyCoverage(_name, *args, **kws)

Run RSeQC’s ‘genebody_coverage.py’ on BAM files

Given a collection of BAM files, runs the RSeQC ‘genebody_coverage.py’ utility (http://rseqc.sourceforge.net/#genebody-coverage-py).

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(bam_files, reference_gene_model, out_dir, name='rseqc')

Initialise the RunRSeQCGenebodyCoverage task

Parameters:

bam_files (list) – list of paths to BAM files to run genebody_coverage.py on
reference_gene_model (str) – path to BED file with the reference gene model data
out_dir (str) – path to a directory where the output files will be written
name (str) – optional basename for the output files (defaults to ‘rseqc’)

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.RunRSeQCInferExperiment(_name, *args, **kws)

Run RSeQC’s ‘infer_experiment.py’ on BAM files

Given a list of BAM files, for each file runs the RSeQC ‘infer_experiment.py’ utility (http://rseqc.sourceforge.net/#infer-experiment-py).

The log for each run is written to a file called ‘<BASENAME>.infer_experiment.log’; the data are also extracted and put into an output parameter for direct consumption by downstream tasks.

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(bam_files, reference_gene_model, out_dir)

Initialise the RunRSeQCInferExperiment task

Parameters:

bam_files (list) – list of paths to BAM files to run infer_experiment.py on
reference_gene_model (str) – path to BED file with the reference gene model data
out_dir (str) – path to a directory where the output files will be written

Outputs:

experiments: a dictionary with BAM files as: keys; each value is another dictionary with keys ‘paired_end’ (True for paired-end data, False for single-end), ‘reverse’, ‘forward’ and ‘unstranded’ (fractions of reads mapped in each configuration).

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.SetCellCountFromCellranger(_name, *args, **kws)

Update the number of cells in the project metadata from ‘cellranger count’ or ‘cellranger multi’ output

init(project, qc_dir=None, source='count')

Initialise the SetCellCountFromCellranger task.

Parameters:

project (AnalysisProject) – project to update the number of cells for
qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)
source (str) – either ‘count’ (the default) or ‘multi’

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.SetupFastqStrandConf(_name, *args, **kws)

Set up a fastq_strand.conf file

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, qc_dir=None, organism=None, star_indexes=None)

Initialise the SetupFastqStrandConf task.

Parameters:

project (AnalysisProject) – project to run QC for
qc_dir (str) – if supplied then points to directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)
organism (str) – if supplied then must be a string with the names of one or more organisms, with multiple organisms separated by spaces (defaults to the organisms associated with the project)
star_indexes (dict) – dictionary mapping normalised organism names to STAR indexes

Outputs:

fastq_strand_conf (PipelineParam): pipeline: parameter instance that resolves to a string with the path to the generated config file.

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.SetupQCDirs(_name, *args, **kws)

Set up the directories for the QC run

init(project, qc_dir, log_dir=None, protocol=None)

Initialise the SetupQCDirs task

Parameters:

project (AnalysisProject) – project to run QC for
qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)
log_dir (str) – directory for log files (defaults to ‘logs’ subdirectory of the QC directory
protocol (QCProject) – QC protocol being used

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.SplitFastqsByLane(_name, *args, **kws)

Split reads into multiple Fastqs according to lane

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, out_dir)

Initialise the SplitFastqsByLane task

Parameters:

project (AnalysisProject) – project with source Fastqs to split by lane
out_dir (str) – path to directory where split Fastqs will be written

Outputs:

fastqs (list): list of paths to output Fastqs: split by lanes

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.UpdateQCMetadata(_name, *args, **kws)

Update the metadata stored for this QC run

init(project, qc_dir, metadata, legacy_screens=False)

Initialise the UpdateQCMetadata task

Parameters:

project (AnalysisProject) – project to run QC for
qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)
metadata (dict) – mapping of metadata items to values
legacy_screens (bool) – if True then ‘legacy’ naming convention was used for FastqScreen outputs

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.VerifyFastqs(_name, *args, **kws)

Check Fastqs are valid

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, fastq_attrs=None)

Initialise the VerifyFastqs task

Parameters:

project (AnalysisProject) – project with Fastqs to check
fastq_attrs (BaseFastqAttrs) – class to use for extracting data from Fastq names

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.VerifyQC(_name, *args, **kws)

Verify outputs from the QC pipeline

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, qc_dir, protocol, fastqs)

Initialise the VerifyQC task.

Parameters:

project (AnalysisProject) – project to update the number of cells for
qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)
protocol (QCProtocl) – QC protocol to verify against
fastqs (list) – Fastqs to include in the verification

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

auto_process_ngs.qc.pipeline

`auto_process_ngs.qc.pipeline`