auto_process_ngs.qc.pipeline

Pipeline components for running the QC pipeline.

Pipeline classes:

  • QCPipeline

Pipeline task classes:

  • SetupQCDirs

  • SplitFastqsByLane

  • GetSequenceDataSamples

  • GetSequenceDataFastqs

  • UpdateQCMetadata

  • VerifyFastqs

  • GetSeqLengthStats

  • CheckFastqScreenOutputs

  • RunFastqScreen

  • CheckFastQCOutputs

  • RunFastQC

  • SetupFastqStrandConf

  • CheckFastqStrandOutputs

  • RunFastqStrand

  • DetermineRequired10xPackage

  • GetCellrangerReferenceData

  • MakeCellrangerArcCountLibraries

  • GetCellrangerMultiConfig

  • CheckCellrangerCountOutputs

  • RunCellrangerCount

  • RunCellrangerMulti

  • SetCellCountFromCellranger

  • GetReferenceDataset

  • GetBAMFiles

  • RunRSeQCGenebodyCoverage

  • RunPicardCollectInsertSizeMetrics

  • CollateInsertSizes

  • ConvertGTFToBed

  • RunRSeQCInferExperiment

  • RunQualimapRnaseq

  • ReportQC

Also imports the following pipeline tasks:

  • Get10xPackage

class auto_process_ngs.qc.pipeline.CheckCellrangerCountOutputs(_name, *args, **kws)

Check the outputs from cellranger(-atac) count

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, fastq_dir=None, samples=None, qc_dir=None, qc_module=None, extra_projects=None, cellranger_version=None, cellranger_ref_data=None, verbose=False)

Initialise the CheckCellrangerCountOutputs task.

Parameters:
  • project (AnalysisProject) – project to run QC for

  • fastq_dir (str) – directory holding Fastq files

  • samples (list) – list of samples to restrict checks to (all samples in project are checked by default)

  • qc_dir (str) – top-level QC directory to look for ‘count’ QC outputs (e.g. metrics CSV and summary HTML files)

  • qc_module (str) – QC protocol being used

  • extra_projects (list) – optional list of extra AnalysisProjects to include Fastqs from when running cellranger pipeline

  • cellranger_version (str) – version number of 10xGenomics package

  • cellranger_ref_data (str) – name or path to reference dataset for single library analysis

  • verbose (bool) – if True then print additional information from the task

Outputs:
fastq_dir (PipelineParam): pipeline parameter

instance that resolves to a string with the path to directory with Fastq files

samples (list): list of sample names that have

missing outputs from ‘cellranger count’

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.CheckFastQCOutputs(_name, *args, **kws)

Check the outputs from FastQC

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, qc_dir, read_numbers, fastqs=None, verbose=False)

Initialise the CheckFastQCOutputs task.

Parameters:
  • project (AnalysisProject) – project to run QC for

  • qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)

  • read_numbers (list) – list of read numbers to include

  • fastqs (list) – optional, list of Fastq files (overrides Fastqs in project)

  • verbose (bool) – if True then print additional information from the task

Outputs:
fastqs (list): list of Fastqs that have

missing FastQC outputs under the specified QC protocol

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.CheckFastqScreenOutputs(_name, *args, **kws)

Check the outputs from FastqScreen

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, qc_dir, screens, fastqs=None, read_numbers=None, include_samples=None, fastq_attrs=None, legacy=False, verbose=False)

Initialise the CheckFastqScreenOutputs task.

Parameters:
  • project (AnalysisProject) – project to run QC for

  • qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)

  • screens (mapping) – mapping of screen names to FastqScreen conf files

  • fastqs (list) – explicit list of Fastq files to check against (default is to use Fastqs from supplied analysis project)

  • read_numbers (list) – read numbers to include

  • include_samples (list) – optional, list of sample names to include

  • fastq_attrs (BaseFastqAttrs) – class to use for extracting data from Fastq names

  • legacy (bool) – if True then use ‘legacy’ naming convention for output files (default is to use new format)

  • verbose (bool) – if True then print additional information from the task

Outputs:
fastqs (list): list of Fastqs that have

missing FastqScreen outputs under the specified QC protocol

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.CheckFastqStrandOutputs(_name, *args, **kws)

Check the outputs from the fastq_strand.py utility

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, qc_dir, fastq_strand_conf, fastqs=None, read_numbers=None, include_samples=None, verbose=False)

Initialise the CheckFastqStrandOutputs task.

Parameters:
  • project (AnalysisProject) – project to run QC for

  • qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)

  • fastq_strand_conf (str) – path to the fastq_strand config file

  • fastqs (list) – explicit list of Fastq files to check against (default is to use Fastqs from supplied analysis project)

  • read_numbers (list) – list of read numbers to include when checking outputs

  • include_samples (list) – optional, list of sample names to include

  • verbose (bool) – if True then print additional information from the task

Outputs:
fastq_pairs (list): list of tuples with Fastq

“pairs” that have missing outputs from fastq_strand.py under the specified QC protocol. A “pair” may be an (R1,R2) tuple, or a single Fastq (e.g. (fq,)).

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.CollateInsertSizes(_name, *args, **kws)

Collate insert size metrics data from multiple BAMs

Gathers together the Picard insert size data from a set of BAM files and puts them into a single TSV file.

init(bam_files, picard_out_dir, out_file, delimiter='\t')

Initialise the CollateInsertSizes task

Parameters:
  • bam_files (list) – list of paths to BAM files to get associated insert size data for

  • picard_out_dir (str) – path to the directory containing the Picard CollectInsertSizeMetrics output files

  • out_file (str) – path to the output TSV file

  • delimiter (str) – specify the delimiter to use in the output file

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.ConvertGTFToBed(_name, *args, **kws)

Convert a GTF file to a BED file using BEDOPS ‘gtf2bed’

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(gtf_in, bed_out)

Initialise the ConvertGTFToBed task

Parameters:
  • gtf_in (str) – path to the input GTF file

  • bed_out (str) – path to the output BED file

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.DetermineRequired10xPackage(_name, *args, **kws)

Determine which 10xGenomics software package is required

By default determines the package name based on the supplied QC module, but this can be overridden by explicitly supplying a required package (which can also be a path to an executable).

The output ‘require_cellranger’ parameter should be supplied to the ‘Get10xPackage’ task, which will do the job of actually locating an executable.

init(qc_module, require_cellranger=None)

Initialise the DetermineRequired10xPackage task

Argument:

qc_module (str): QC module being used require_cellranger (str): optional package name

or path to an executable; if supplied then overrides the automatic package determination

Outputs:
require_cellranger (pipelineParam): the 10xGenomics

software package name or path to use

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.GetBAMFiles(_name, *args, **kws)

Create BAM files from Fastqs using STAR

Runs STAR to generate BAM files from Fastq files. The BAMs are then sorted and indexed using samtools.

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(fastqs, star_index, out_dir, subset_size=None, nthreads=None, reads=None, include_samples=None, fastq_attrs=None, verbose=False)

Initialise the GetBamFiles task

Parameters:
  • fastqs (list) – list of Fastq files to generate BAM files from

  • star_index (str) – path to STAR index to use

  • out_dir (str) – path to directory to write final BAM files to

  • subset_size (int) – specify size of a random subset of reads to use in BAM file generation

  • nthreads (int) – number of cores for STAR to use

  • reads (list) – optional, list of read numbers to include (e.g. [1,2], [2] etc)

  • include_samples (list) – optional, list of sample names to include

  • fastq_attrs (IlluminaFastq) – optional, class to use for extracting information from Fastq file names

  • verbose (bool) – if True then print additional information from the task

Outputs:

bam_files: list of sorted BAM files

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.GetCellrangerMultiConfig(_name, *args, **kws)

Locate ‘config.csv’ files for cellranger multi

init(project, qc_dir)

Initialise the GetCellrangerMultiConfig task.

Parameters:
  • project (AnalysisProject) – project to run QC for

  • qc_dir (str) – top-level QC directory to put ‘config.csv’ files

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.GetCellrangerReferenceData(_name, *args, **kws)
init(project, organism=None, transcriptomes=None, premrna_references=None, atac_references=None, multiome_references=None, cellranger_exe=None, cellranger_version=None, force_reference_data=None)

Initialise the GetCellrangerReferenceData task

Parameters:
  • project (AnalysisProject) – project to run QC for

  • organism (str) – if supplied then must be a string with the names of one or more organisms, with multiple organisms separated by spaces (defaults to the organisms associated with the project)

  • transcriptomes (mapping) – mapping of organism names to reference transcriptome data for cellranger

  • premrna_references (mapping) – mapping of organism names to “pre-mRNA” reference data for cellranger

  • atac_references (mapping) – mapping of organism names to reference genome data for cellranger-atac

  • multiome_references (mapping) – mapping of organism names to reference datasets for cellranger-arc

  • cellranger_exe (str) – the path to the Cellranger software package to use (e.g. ‘cellranger’, ‘cellranger-atac’, ‘spaceranger’)

  • cellranger_version (str) – the version string for the Cellranger package

  • force_reference_data (str) – if supplied then will be used as the reference dataset, instead of trying to locate appropriate reference data automatically

Outputs:
reference_data_path (PipelineParam): pipeline

parameter instance which resolves to a string with the path to the reference data set corresponding to the supplied organism.

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.GetReferenceDataset(_name, *args, **kws)

Acquire reference data for an organism from mapping

Generic lookup task which attempts to locate the matching reference dataset from a mapping/dictionary.

init(organism, references, force_reference=None)

Initialise the GetReferenceDataset task

Parameters:
  • organism (str) – name of the organism

  • references (mapping) – mapping with organism names as keys and reference datasets as corresponding values

  • force_reference (str) – if specified then return the supplied value instead of determining from the organism

Outputs:
reference_dataset: reference dataset (set to None

if no dataset could be located)

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.GetSeqLengthStats(_name, *args, **kws)

Get data on sequence lengths, masking and padding for Fastqs in a project, and write the data to JSON files.

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, qc_dir, read_numbers=None, fastqs=None, fastq_attrs=None)

Initialise the GetSeqLengthStats task

Parameters:
  • project (AnalysisProject) – project with Fastqs to get the sequence length data from

  • qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)

  • read_numbers (sequence) – list of read numbers to include (or None to include all reads)

  • fastqs (list) – optional, list of Fastq files (overrides Fastqs in project)

  • fastq_attrs (BaseFastqAttrs) – class to use for extracting data from Fastq names

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.GetSequenceDataFastqs(_name, *args, **kws)

Set up Fastqs with sequence (i.e. biological) data

init(project, out_dir, read_range, samples, fastq_attrs, fastqs=None)

Initialise the GetSequenceDataFastqs task

Parameters:
  • project (AnalysisProject) – project to get Fastqs for

  • out_dir (str) – path to directory to write final Fastq files to

  • read_range (dict) – mapping of read names to tuples of subsequence ranges

  • samples (list) – list of samples with sequence data

  • fastq_attrs (BaseFastqAttrs) – class to use for extracting data from Fastq names

  • fastqs (list) – optional, list of Fastq files (overrides Fastqs in project)

Outputs:
fastqs (list): list of Fastqs with biological

data

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.GetSequenceDataSamples(_name, *args, **kws)

Identify samples with sequence (i.e. biological) data

init(project, fastq_attrs)

Initialise the GetSequenceDataSamples task

Parameters:
Outputs:
seq_data_samples (list): list of samples with

biological data

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.MakeCellrangerArcCountLibraries(_name, *args, **kws)

Make ‘libraries.csv’ files for cellranger-arc count

init(project, qc_dir)

Initialise the MakeCellrangerArcCountLibraries task.

Parameters:
  • project (AnalysisProject) – project to run QC for

  • qc_dir (str) – top-level QC directory to put ‘libraries.csv’ files

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.QCPipeline

Run the QC pipeline on one or more projects

Pipeline to run QC on multiple projects.

Example usage:

>>> qc = QCPipeline()
>>> qc.add_project(AnalysisProject("AB","./AB")
>>> qc.add_project(AnalysisProject("CDE","./CDE")
>>> qc.run()
add_cellranger_count(project_name, project, qc_dir, organism, fastq_dir, qc_module_name, library_type, chemistry, force_cells, samples=None, fastq_dirs=None, reference_dataset=None, extra_projects=None, required_tasks=None)

Add tasks to pipeline to run ‘cellranger* count’

Parameters:
  • project_name (str) – name to associate with project for reporting tasks

  • project (AnalysisProject) – project to run 10x cellranger pipeline within

  • qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)

  • organism (str) – organism for pipeline

  • fastq_dir (str) – directory holding Fastq files

  • qc_module (str) – QC module being used

  • library_type (str) – type of data being analysed (e.g. ‘scRNA-seq’)

  • chemistry (str) – chemistry to use in single library analysis

  • force_cells (int) – if set then bypasses the cell detection algorithm in ‘cellranger’ and ‘cellranger-atac’ using the ‘–force-cells’ option (does nothing for ‘cellranger-arc’)

  • samples (list) – optional, list of samples to restrict single library analyses to (or None to use all samples in project)

  • fastq_dirs (dict) – optional, a dictionary mapping sample names to Fastq directories which will be used to override the paths set by the ‘fastq_dirs’ argument

  • reference_dataset (str) – optional, path to reference dataset (otherwise will be determined automatically based on organism)

  • extra_projects (list) – optional list of extra AnalysisProjects to include Fastqs from when running cellranger pipeline

  • required_tasks (list) – list of tasks that the cellranger pipeline should wait for

add_project(project, protocol, qc_dir=None, organism=None, fastq_dir=None, report_html=None, multiqc=False, sample_pattern=None, log_dir=None, convert_gtf=True, verify_fastqs=False, split_fastqs_by_lane=False)

Add a project to the QC pipeline

Parameters:
  • project (AnalysisProject) – project to run QC for

  • protocol (QCProtocol) – QC protocol to use

  • qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)

  • organism (str) – organism(s) for project (defaults to organism defined in project metadata)

  • fastq_dir (str) – directory holding Fastq files (defaults to primary fastq_dir in project)

  • multiqc (bool) – if True then also run MultiQC (default is not to run MultiQC)

  • sample_pattern (str) – glob-style pattern to match a subset of projects and samples (not implemented)

  • log_dir (str) – directory to write log files to (defaults to ‘logs’ subdirectory of the QC directory)

  • convert_gtf (bool) – if True then convert GTF files to BED for ‘infer_experiment.py’ (default; otherwise only use the explicitly defined BED files)

  • verify_fastqs (bool) – if True then verify Fastq integrity as part of the pipeline (default: False, skip verification)

  • split_fastqs_by_lanes (bool) – if True then split input Fastqs into lanes and run QC as per-lane (default: False, don’t split QC by lanes)

add_task(task, requires=(), **kws)

Override base class method

Automatically set log dir when tasks are added

property default_log_dir

Return current value of default log dir

run(nthreads=None, fastq_screens=None, star_indexes=None, annotation_bed_files=None, annotation_gtf_files=None, fastq_subset=None, cellranger_chemistry='auto', cellranger_force_cells=None, cellranger_transcriptomes=None, cellranger_premrna_references=None, cellranger_atac_references=None, cellranger_arc_references=None, cellranger_jobmode='local', cellranger_maxjobs=None, cellranger_mempercore=None, cellranger_jobinterval=None, cellranger_localcores=None, cellranger_localmem=None, cellranger_exe=None, cellranger_extra_projects=None, cellranger_reference_dataset=None, cellranger_out_dir=None, force_star_index=None, force_gtf_annotation=None, working_dir=None, log_file=None, batch_size=None, batch_limit=None, max_jobs=1, max_slots=None, poll_interval=5, runners=None, default_runner=None, enable_conda=False, conda=None, conda_env_dir=None, envmodules=None, legacy_screens=False, verbose=False)

Run the tasks in the pipeline

Parameters:
  • nthreads (int) – number of threads/processors to use for QC jobs (defaults to number of slots set in job runners)

  • fastq_screens (dict) – mapping of screen IDs to FastqScreen conf files

  • star_indexes (dict) – mapping of organism IDs to directories with STAR indexes

  • annotation_bed_files (dict) – mapping of organism IDs to BED files with annotation data

  • annotation_gtf_files (dict) – mapping of organism IDs to GTF files with annotation data

  • fastq_subset (int) – explicitly specify the subset size for subsetting running Fastqs

  • cellranger_chemistry (str) – explicitly specify the assay configuration (set to ‘auto’ to let cellranger determine this automatically; ignored if not scRNA-seq)

  • force_cells (int) – explicitly specify number of cells for ‘cellranger’ and ‘cellranger-atac’ (set to ‘None’ to use the cell detection algorithm; ignored for ‘cellranger-arc’)

  • cellranger_transcriptomes (mapping) – mapping of organism names to reference transcriptome data for cellranger

  • cellranger_premrna_references (mapping) – mapping of organism names to “pre-mRNA” reference data for cellranger

  • cellranger_atac_references (mapping) – mapping of organism names to ATAC-seq reference genome data for cellranger-atac

  • cellranger_arc_references (mapping) – mapping of organism names to multiome reference datasets for cellranger-arc

  • cellranger_jobmode (str) – specify the job mode to pass to cellranger (default: “local”)

  • cellranger_maxjobs (int) – specify the maximum number of jobs to pass to cellranger (default: None)

  • cellranger_mempercore (int) – specify the memory per core (in Gb) to pass to cellranger (default: None)

  • cellranger_jobinterval (int) – specify the interval between launching jobs (in ms) to pass to cellranger (default: None)

  • cellranger_localcores (int) – maximum number of cores cellranger can request in jobmode ‘local’ (default: None)

  • cellranger_localmem (int) – maximum memory cellranger can request in jobmode ‘local’ (default: None)

  • cellranger_exe (str) – optional, explicitly specify the cellranger executable to use for single library analysis (default: cellranger executable is determined automatically)

  • cellranger_extra_projects (list) – optional list of extra AnalysisProjects to include Fastqs from when running cellranger pipeline

  • cellranger_reference_dataset (str) – optional, explicitly specify the path to the reference dataset to use for single library analysis (default: reference dataset is determined automatically)

  • cellranger_out_dir (str) – specify directory to put full cellranger outputs into (default: project directory)

  • force_star_index (str) – explicitly specify STAR index to use (default: index is determined automatically)

  • force_gtf_annotation (str) – explicitly specify GTF annotation to use (default: annotation file is determined automatically)

  • working_dir (str) – optional path to a working directory (defaults to temporary directory in the current directory)

  • log_dir (str) – path of directory where log files will be written to

  • batch_size (int) – if set then run commands in each task in batches, with each batch running this many commands at a time (default is to run one command per job)

  • batch_limit (int) – if set then run commands in each task in batches, with the batch size set dyanmically so as not to exceed this limit (default is to use fixed batch sizes)

  • max_jobs (int) – optional maximum number of concurrent jobs in scheduler (defaults to 1)

  • max_slots (int) – optional maximum number of ‘slots’ (i.e. concurrent threads or maximum number of CPUs) available to the scheduler (defaults to no limit)

  • poll_interval (float) – optional polling interval (seconds) to set in scheduler (defaults to 5s)

  • runners (dict) – mapping of names to JobRunner instances; valid names are ‘fastqc_runner’, ‘fastq_screen_runner’,’star_runner’,’rseqc_runner’, ‘qualimap_runner’,’cellranger_count_runner’, ‘cellranger_multi_runner’,’report_runner’, ‘verify_runner’, and ‘default’

  • enable_conda (bool) – if True then enable use of conda environments to satisfy task dependencies

  • conda (str) – path to conda

  • conda_env_dir (str) – path to non-default directory for conda environments

  • envmodules (mapping) – mapping of names to environment module file lists; valid names are ‘fastqc’,’fastq_screen’,’fastq_strand’,’cellranger’, ‘report_qc’

  • default_runner (JobRunner) – optional default job runner to use

  • legacy_screens (bool) – if True then use ‘legacy’ naming convention for FastqScreen outputs

  • verbose (bool) – if True then report additional information for diagnostics

set_default_log_dir(log_dir)

Set the default log directory for tasks

class auto_process_ngs.qc.pipeline.ReportQC(_name, *args, **kws)

Generate the QC report

init(project, qc_dir, report_html=None, fastq_dir=None, multiqc=False, force=False, zip_outputs=True)

Initialise the ReportQC task.

Parameters:
  • project (AnalysisProject) – project to generate QC report for

  • qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)

  • report_html (str) – set the name of the output HTML file for the report

  • fastq_dir (str) – directory holding Fastq files (defaults to current fastq_dir in project)

  • multiqc (bool) – if True then also generate MultiQC report (default: don’t run MultiQC)

  • force (bool) – if True then force HTML report to be generated even if QC outputs fail verification (default: don’t write report)

  • zip_outputs (bool) – if True then also generate a ZIP archive of the QC reports

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.RunCellrangerCount(_name, *args, **kws)

Run ‘cellranger count’

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(samples, fastq_dir, reference_data_path, library_type, out_dir, qc_dir=None, cellranger_exe=None, cellranger_version=None, chemistry='auto', fastq_dirs=None, force_cells=None, cellranger_jobmode='local', cellranger_maxjobs=None, cellranger_mempercore=None, cellranger_jobinterval=None, cellranger_localcores=None, cellranger_localmem=None)

Initialise the RunCellrangerCount task.

Parameters:
  • samples (list) – list of sample names to run cellranger count on (it is expected that this list will come from the CheckCellrangerCountsOutputs task)

  • fastq_dir (str) – path to directory holding the Fastq files

  • reference_data_path (str) – path to the cellranger compatible reference dataset

  • library_type (str) – type of data being analysed (e.g. ‘scRNA-seq’)

  • out_dir (str) – top-level directory to copy all final ‘count’ outputs into. Outputs won’t be copied if no value is supplied

  • qc_dir (str) – top-level QC directory to put ‘count’ QC outputs (e.g. metrics CSV and summary HTML files) into. Outputs won’t be copied if no value is supplied

  • cellranger_exe (str) – the path to the Cellranger software package to use (e.g. ‘cellranger’, ‘cellranger-atac’, ‘spaceranger’)

  • cellranger_version (str) – the version string for the Cellranger package

  • fastq_dirs (dict) – optional, a dictionary mapping sample names to Fastq directories which will be used to override the paths set by the ‘fastq_dirs’ argument

  • force_cells (int) – optional, if set then bypasses the cell detection algorithm in ‘cellranger’ and ‘cellranger-atac’ using the ‘–force-cells’ option (does nothing for ‘cellranger-arc’)

  • chemistry (str) – assay configuration (set to ‘auto’ to let cellranger determine this automatically; ignored if not scRNA-seq)

  • cellranger_jobmode (str) – specify the job mode to pass to cellranger (default: “local”)

  • cellranger_maxjobs (int) – specify the maximum number of jobs to pass to cellranger (default: None)

  • cellranger_mempercore (int) – specify the memory per core (in Gb) to pass to cellranger (default: None)

  • cellranger_jobinterval (int) – specify the interval between launching jobs (in ms) to pass to cellranger (default: None)

  • cellranger_localcores (int) – maximum number of cores cellranger can request in jobmode ‘local’ (defaults to number of slots set in runner)

  • cellranger_localmem (int) – maximum memory cellranger can request in jobmode ‘local’ (default: None)

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.RunCellrangerMulti(_name, *args, **kws)

Run ‘cellranger multi’

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, config_csvs, samples, reference_data_path, out_dir, qc_dir=None, cellranger_exe=None, cellranger_version=None, cellranger_jobmode='local', cellranger_maxjobs=None, cellranger_mempercore=None, cellranger_jobinterval=None, cellranger_localcores=None, cellranger_localmem=None, working_dir=None)

Initialise the RunCellrangerMulti task.

Parameters:
  • project (AnalysisProject) – project to run QC for

  • config_csvs (list) – list of paths to ‘cellranger multi’ configuration files

  • samples (list) – list of sample names from the config.csv file

  • reference_data_path (str) – path to the cellranger compatible reference dataset from the config.csv file

  • out_dir (str) – top-level directory to copy all final ‘count’ outputs into. Outputs won’t be copied if no value is supplied

  • qc_dir (str) – top-level QC directory to put ‘count’ QC outputs (e.g. metrics CSV and summary HTML files) into. Outputs won’t be copied if no value is supplied

  • cellranger_exe (str) – the path to the Cellranger software package to use (e.g. ‘cellranger’, ‘cellranger-atac’, ‘spaceranger’)

  • cellranger_version (str) – the version string for the Cellranger package

  • cellranger_jobmode (str) – specify the job mode to pass to cellranger (default: “local”)

  • cellranger_maxjobs (int) – specify the maximum number of jobs to pass to cellranger (default: None)

  • cellranger_mempercore (int) – specify the memory per core (in Gb) to pass to cellranger (default: None)

  • cellranger_jobinterval (int) – specify the interval between launching jobs (in ms) to pass to cellranger (default: None)

  • cellranger_localcores (int) – maximum number of cores cellranger can request in jobmode ‘local’ (defaults to number of slots set in runner)

  • cellranger_localmem (int) – maximum memory cellranger can request in jobmode ‘local’ (default: None)

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.RunFastQC(_name, *args, **kws)

Run FastQC

init(fastqs, qc_dir, nthreads=None)

Initialise the RunIlluminaQC task.

Parameters:
  • fastqs (list) – list of paths to Fastq files to run Fastq Screen on (it is expected that this list will come from the CheckIlluminaQCOutputs task)

  • qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)

  • nthreads (int) – number of threads/processors to use (defaults to number of slots set in runner)

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.RunFastqScreen(_name, *args, **kws)

Run FastqScreen

init(fastqs, qc_dir, screens, subset=None, nthreads=None, read_numbers=None, fastq_attrs=None, legacy=False)

Initialise the RunFastqScreen task.

Parameters:
  • fastqs (list) – list of paths to Fastq files to run Fastq Screen on (it is expected that this list will come from the CheckIlluminaQCOutputs task)

  • qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)

  • screens (mapping) – mapping of screen names to FastqScreen conf files

  • subset (int) – explicitly specify the subset size for running Fastq_screen

  • nthreads (int) – number of threads/processors to use (defaults to number of slots set in runner)

  • read_numbers (list) – list of read numbers to include when running Fastq Screen

  • fastq_attrs (BaseFastqAttrs) – class to use for extracting data from Fastq names

  • legacy (bool) – if True then use ‘legacy’ naming convention for output files (default is to use new format)

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.RunFastqStrand(_name, *args, **kws)

Run the fastq_strand.py utility

init(fastq_pairs, qc_dir, fastq_strand_conf, fastq_strand_subset=None, nthreads=None)

Initialise the RunFastqStrand task.

Parameters:
  • fastq_pairs (list) – list of tuples with “pairs” of Fastq files to run fastq_strand.py on (it is expected that this list will come from the CheckFastqStrandOutputs task)

  • qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)

  • fastq_strand_conf (str) – path to the fastq_strand config file to use

  • fastq_strand_subset (int) – explicitly specify the subset size for running fastq_strand

  • nthreads (int) – number of threads/processors to use (defaults to number of slots set in job runner)

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.RunPicardCollectInsertSizeMetrics(_name, *args, **kws)

Run Picard ‘CollectInsertSizeMetrics’ on BAM files

Given a list of BAM files, for each file first runs the Picard ‘CleanSam’ utility (to remove alignments that would otherwise cause problems for the insert size calculations) and then ‘CollectInsertSizeMetrics’ to generate the insert size metrics.

Note that this task should only be run on BAM files with paired-end data.

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(bam_files, out_dir)

Initialise the RunPicardCollectInsertSizeMetrics task

Parameters:
  • bam_files (list) – list of paths to BAM files to run CollectInsertSizeMetrics on

  • out_dir (str) – path to a directory where the output files will be written

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.RunQualimapRnaseq(_name, *args, **kws)

Run Qualimap’s ‘rnaseq’ module on BAM files

Given a list of BAM files, for each file runs the Qualimap ‘rnaseq’ module (http://qualimap.conesalab.org/doc_html/command_line.html#rna-seq-qc)

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(bam_files, feature_file, out_dir, bam_properties)

Initialise the RunQualimapRnaseq task

Parameters:
  • bam_files (list) – list of paths to BAM files to run Qualimap rnaseq on

  • feature_file (str) – path to GTF file with the reference annotation data

  • out_dir (str) – path to a directory where the output files will be written

  • bam_properties (mapping) – properties for each BAM file from RSeQC ‘infer_experiment.py’ (used to determine if BAM is paired and what the strand-specificity is)

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.RunRSeQCGenebodyCoverage(_name, *args, **kws)

Run RSeQC’s ‘genebody_coverage.py’ on BAM files

Given a collection of BAM files, runs the RSeQC ‘genebody_coverage.py’ utility (http://rseqc.sourceforge.net/#genebody-coverage-py).

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(bam_files, reference_gene_model, out_dir, name='rseqc')

Initialise the RunRSeQCGenebodyCoverage task

Parameters:
  • bam_files (list) – list of paths to BAM files to run genebody_coverage.py on

  • reference_gene_model (str) – path to BED file with the reference gene model data

  • out_dir (str) – path to a directory where the output files will be written

  • name (str) – optional basename for the output files (defaults to ‘rseqc’)

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.RunRSeQCInferExperiment(_name, *args, **kws)

Run RSeQC’s ‘infer_experiment.py’ on BAM files

Given a list of BAM files, for each file runs the RSeQC ‘infer_experiment.py’ utility (http://rseqc.sourceforge.net/#infer-experiment-py).

The log for each run is written to a file called ‘<BASENAME>.infer_experiment.log’; the data are also extracted and put into an output parameter for direct consumption by downstream tasks.

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(bam_files, reference_gene_model, out_dir)

Initialise the RunRSeQCInferExperiment task

Parameters:
  • bam_files (list) – list of paths to BAM files to run infer_experiment.py on

  • reference_gene_model (str) – path to BED file with the reference gene model data

  • out_dir (str) – path to a directory where the output files will be written

Outputs:
experiments: a dictionary with BAM files as

keys; each value is another dictionary with keys ‘paired_end’ (True for paired-end data, False for single-end), ‘reverse’, ‘forward’ and ‘unstranded’ (fractions of reads mapped in each configuration).

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.SetCellCountFromCellranger(_name, *args, **kws)

Update the number of cells in the project metadata from ‘cellranger count’ or ‘cellranger multi’ output

init(project, qc_dir=None, source='count')

Initialise the SetCellCountFromCellranger task.

Parameters:
  • project (AnalysisProject) – project to update the number of cells for

  • qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)

  • source (str) – either ‘count’ (the default) or ‘multi’

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.SetupFastqStrandConf(_name, *args, **kws)

Set up a fastq_strand.conf file

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, qc_dir=None, organism=None, star_indexes=None)

Initialise the SetupFastqStrandConf task.

Parameters:
  • project (AnalysisProject) – project to run QC for

  • qc_dir (str) – if supplied then points to directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)

  • organism (str) – if supplied then must be a string with the names of one or more organisms, with multiple organisms separated by spaces (defaults to the organisms associated with the project)

  • star_indexes (dict) – dictionary mapping normalised organism names to STAR indexes

Outputs:
fastq_strand_conf (PipelineParam): pipeline

parameter instance that resolves to a string with the path to the generated config file.

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.SetupQCDirs(_name, *args, **kws)

Set up the directories for the QC run

init(project, qc_dir, log_dir=None, protocol=None)

Initialise the SetupQCDirs task

Parameters:
  • project (AnalysisProject) – project to run QC for

  • qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)

  • log_dir (str) – directory for log files (defaults to ‘logs’ subdirectory of the QC directory

  • protocol (QCProject) – QC protocol being used

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.SplitFastqsByLane(_name, *args, **kws)

Split reads into multiple Fastqs according to lane

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, out_dir)

Initialise the SplitFastqsByLane task

Parameters:
  • project (AnalysisProject) – project with source Fastqs to split by lane

  • out_dir (str) – path to directory where split Fastqs will be written

Outputs:
fastqs (list): list of paths to output Fastqs

split by lanes

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.UpdateQCMetadata(_name, *args, **kws)

Update the metadata stored for this QC run

init(project, qc_dir, metadata, legacy_screens=False)

Initialise the UpdateQCMetadata task

Parameters:
  • project (AnalysisProject) – project to run QC for

  • qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)

  • metadata (dict) – mapping of metadata items to values

  • legacy_screens (bool) – if True then ‘legacy’ naming convention was used for FastqScreen outputs

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.VerifyFastqs(_name, *args, **kws)

Check Fastqs are valid

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, fastq_attrs=None)

Initialise the VerifyFastqs task

Parameters:
setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.VerifyQC(_name, *args, **kws)

Verify outputs from the QC pipeline

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, qc_dir, protocol, fastqs)

Initialise the VerifyQC task.

Parameters:
  • project (AnalysisProject) – project to update the number of cells for

  • qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)

  • protocol (QCProtocl) – QC protocol to verify against

  • fastqs (list) – Fastqs to include in the verification

setup()

Set up commands to be performed by the task

Must be implemented by the subclass