auto_process_ngs.qc.pipeline

Pipeline components for running the QC pipeline.

Pipeline classes:

  • QCPipeline

Pipeline task classes:

  • SetupQCDirs

  • SplitFastqsByLane

  • GetSequenceDataSamples

  • GetSequenceDataFastqs

  • UpdateQCMetadata

  • VerifyFastqs

  • Set10xCellCount

  • GetReferenceDataset

  • GetBAMFile

  • ConvertGTFToBed

  • VerifyQC

  • ReportQC

class auto_process_ngs.qc.pipeline.ConvertGTFToBed(_name, *args, **kws)

Convert a GTF file to a BED file using BEDOPS ‘gtf2bed’

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(gtf_in, bed_out)

Initialise the ConvertGTFToBed task

Parameters:
  • gtf_in (str) – path to the input GTF file

  • bed_out (str) – path to the output BED file

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.GetBAMFiles(_name, *args, **kws)

Create BAM files from Fastqs using STAR

Runs STAR to generate BAM files from Fastq files. The BAMs are then sorted and indexed using samtools.

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(fastqs, star_index, out_dir, subset_size=None, nthreads=None, reads=None, include_samples=None, fastq_attrs=None, verbose=False)

Initialise the GetBamFiles task

Parameters:
  • fastqs (list) – list of Fastq files to generate BAM files from

  • star_index (str) – path to STAR index to use

  • out_dir (str) – path to directory to write final BAM files to

  • subset_size (int) – specify size of a random subset of reads to use in BAM file generation

  • nthreads (int) – number of cores for STAR to use

  • reads (list) – optional, list of read numbers to include (e.g. [1,2], [2] etc)

  • include_samples (list) – optional, list of sample names to include

  • fastq_attrs (IlluminaFastq) – optional, class to use for extracting information from Fastq file names

  • verbose (bool) – if True then print additional information from the task

Outputs:

bam_files: list of sorted BAM files

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.GetReferenceDataset(_name, *args, **kws)

Acquire reference data for an organism from mapping

Generic lookup task which attempts to locate the matching reference dataset from a mapping/dictionary.

init(organism, references, force_reference=None)

Initialise the GetReferenceDataset task

Parameters:
  • organism (str) – name of the organism

  • references (mapping) – mapping with organism names as keys and reference datasets as corresponding values

  • force_reference (str) – if specified then return the supplied value instead of determining from the organism

Outputs:
reference_dataset: reference dataset (set to None

if no dataset could be located)

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.GetSequenceDataFastqs(_name, *args, **kws)

Set up Fastqs with sequence (i.e. biological) data

init(project, out_dir, read_range, samples, fastq_attrs, fastqs=None)

Initialise the GetSequenceDataFastqs task

Parameters:
  • project (AnalysisProject) – project to get Fastqs for

  • out_dir (str) – path to directory to write final Fastq files to

  • read_range (dict) – mapping of read names to tuples of subsequence ranges

  • samples (list) – list of samples with sequence data

  • fastq_attrs (BaseFastqAttrs) – class to use for extracting data from Fastq names

  • fastqs (list) – optional, list of Fastq files (overrides Fastqs in project)

Outputs:
fastqs (list): list of Fastqs with biological

data

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.GetSequenceDataSamples(_name, *args, **kws)

Identify samples with sequence (i.e. biological) data

init(project, fastq_attrs)

Initialise the GetSequenceDataSamples task

Parameters:
Outputs:
seq_data_samples (list): list of samples with

biological data

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.QCPipeline

Run the QC pipeline on one or more projects

Pipeline to run QC on multiple projects.

Example usage:

>>> qc = QCPipeline()
>>> qc.add_project(AnalysisProject("AB","./AB")
>>> qc.add_project(AnalysisProject("CDE","./CDE")
>>> qc.run()
add_project(project, protocol, qc_dir=None, organism=None, fastq_dir=None, report_html=None, multiqc=False, sample_pattern=None, log_dir=None, convert_gtf=True, verify_fastqs=False, split_fastqs_by_lane=False)

Add a project to the QC pipeline

Parameters:
  • project (AnalysisProject) – project to run QC for

  • protocol (QCProtocol) – QC protocol to use

  • qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)

  • organism (str) – organism(s) for project (defaults to organism defined in project metadata)

  • fastq_dir (str) – directory holding Fastq files (defaults to primary fastq_dir in project)

  • multiqc (bool) – if True then also run MultiQC (default is not to run MultiQC)

  • sample_pattern (str) – glob-style pattern to match a subset of projects and samples (not implemented)

  • log_dir (str) – directory to write log files to (defaults to ‘logs’ subdirectory of the QC directory)

  • convert_gtf (bool) – if True then convert GTF files to BED for ‘infer_experiment.py’ (default; otherwise only use the explicitly defined BED files)

  • verify_fastqs (bool) – if True then verify Fastq integrity as part of the pipeline (default: False, skip verification)

  • split_fastqs_by_lanes (bool) – if True then split input Fastqs into lanes and run QC as per-lane (default: False, don’t split QC by lanes)

add_task(task, requires=(), **kws)

Override base class method

Automatically set log dir when tasks are added

property default_log_dir

Return current value of default log dir

run(nthreads=None, fastq_screens=None, star_indexes=None, annotation_bed_files=None, annotation_gtf_files=None, fastq_subset=None, cellranger_chemistry='auto', cellranger_force_cells=None, cellranger_transcriptomes=None, cellranger_premrna_references=None, cellranger_atac_references=None, cellranger_arc_references=None, cellranger_jobmode='local', cellranger_maxjobs=None, cellranger_mempercore=None, cellranger_jobinterval=None, cellranger_localcores=None, cellranger_localmem=None, cellranger_exe=None, cellranger_extra_projects=None, cellranger_reference_dataset=None, cellranger_out_dir=None, force_star_index=None, force_gtf_annotation=None, working_dir=None, log_file=None, batch_size=None, batch_limit=None, max_jobs=1, max_slots=None, poll_interval=5, runners=None, default_runner=None, enable_conda=False, conda=None, conda_env_dir=None, envmodules=None, shorten_zip_paths=False, legacy_screens=False, verbose=False)

Run the tasks in the pipeline

Parameters:
  • nthreads (int) – number of threads/processors to use for QC jobs (defaults to number of slots set in job runners)

  • fastq_screens (dict) – mapping of screen IDs to FastqScreen conf files

  • star_indexes (dict) – mapping of organism IDs to directories with STAR indexes

  • annotation_bed_files (dict) – mapping of organism IDs to BED files with annotation data

  • annotation_gtf_files (dict) – mapping of organism IDs to GTF files with annotation data

  • fastq_subset (int) – explicitly specify the subset size for subsetting running Fastqs

  • cellranger_chemistry (str) – explicitly specify the assay configuration (set to ‘auto’ to let cellranger determine this automatically; ignored if not scRNA-seq)

  • force_cells (int) – explicitly specify number of cells for ‘cellranger’ and ‘cellranger-atac’ (set to ‘None’ to use the cell detection algorithm; ignored for ‘cellranger-arc’)

  • cellranger_transcriptomes (mapping) – mapping of organism names to reference transcriptome data for cellranger

  • cellranger_premrna_references (mapping) – mapping of organism names to “pre-mRNA” reference data for cellranger

  • cellranger_atac_references (mapping) – mapping of organism names to ATAC-seq reference genome data for cellranger-atac

  • cellranger_arc_references (mapping) – mapping of organism names to multiome reference datasets for cellranger-arc

  • cellranger_jobmode (str) – specify the job mode to pass to cellranger (default: “local”)

  • cellranger_maxjobs (int) – specify the maximum number of jobs to pass to cellranger (default: None)

  • cellranger_mempercore (int) – specify the memory per core (in Gb) to pass to cellranger (default: None)

  • cellranger_jobinterval (int) – specify the interval between launching jobs (in ms) to pass to cellranger (default: None)

  • cellranger_localcores (int) – maximum number of cores cellranger can request in jobmode ‘local’ (default: None)

  • cellranger_localmem (int) – maximum memory cellranger can request in jobmode ‘local’ (default: None)

  • cellranger_exe (str) – optional, explicitly specify the cellranger executable to use for single library analysis (default: cellranger executable is determined automatically)

  • cellranger_extra_projects (list) – optional list of extra AnalysisProjects to include Fastqs from when running cellranger pipeline

  • cellranger_reference_dataset (str) – optional, explicitly specify the path to the reference dataset to use for single library analysis (default: reference dataset is determined automatically)

  • cellranger_out_dir (str) – specify directory to put full cellranger outputs into (default: project directory)

  • force_star_index (str) – explicitly specify STAR index to use (default: index is determined automatically)

  • force_gtf_annotation (str) – explicitly specify GTF annotation to use (default: annotation file is determined automatically)

  • working_dir (str) – optional path to a working directory (defaults to temporary directory in the current directory)

  • log_dir (str) – path of directory where log files will be written to

  • batch_size (int) – if set then run commands in each task in batches, with each batch running this many commands at a time (default is to run one command per job)

  • batch_limit (int) – if set then run commands in each task in batches, with the batch size set dyanmically so as not to exceed this limit (default is to use fixed batch sizes)

  • max_jobs (int) – optional maximum number of concurrent jobs in scheduler (defaults to 1)

  • max_slots (int) – optional maximum number of ‘slots’ (i.e. concurrent threads or maximum number of CPUs) available to the scheduler (defaults to no limit)

  • poll_interval (float) – optional polling interval (seconds) to set in scheduler (defaults to 5s)

  • runners (dict) – mapping of names to JobRunner instances; valid names are ‘fastqc_runner’, ‘fastq_screen_runner’,’star_runner’,’rseqc_runner’, ‘qualimap_runner’,’cellranger_count_runner’, ‘cellranger_multi_runner’,’report_runner’, ‘verify_runner’, and ‘default’

  • enable_conda (bool) – if True then enable use of conda environments to satisfy task dependencies

  • conda (str) – path to conda

  • conda_env_dir (str) – path to non-default directory for conda environments

  • envmodules (mapping) – mapping of names to environment module file lists; valid names are ‘fastqc’,’fastq_screen’,’fastq_strand’,’cellranger’, ‘report_qc’

  • default_runner (JobRunner) – optional default job runner to use

  • shorten_zip_paths (bool) – if True then rewrite the the file paths in the ZIP QC reports to shorten them for systems without long filename support

  • legacy_screens (bool) – if True then use ‘legacy’ naming convention for FastqScreen outputs

  • verbose (bool) – if True then report additional information for diagnostics

set_default_log_dir(log_dir)

Set the default log directory for tasks

class auto_process_ngs.qc.pipeline.ReportQC(_name, *args, **kws)

Generate the QC report

init(project, qc_dir, report_html=None, fastq_dir=None, force=False, zip_outputs=True, shorten_zip_paths=False)

Initialise the ReportQC task.

Parameters:
  • project (AnalysisProject) – project to generate QC report for

  • qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)

  • report_html (str) – set the name of the output HTML file for the report

  • fastq_dir (str) – directory holding Fastq files (defaults to current fastq_dir in project)

  • force (bool) – if True then force HTML report to be generated even if QC outputs fail verification (default: don’t write report)

  • zip_outputs (bool) – if True then also generate a ZIP archive of the QC reports

  • shorten_zip_paths (bool) – if True then rewrites file paths in ZIP archive to short versions (default: don’t shorten paths)

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.RunMultiQC(_name, *args, **kws)

Run MultiQC

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, qc_dir, fastq_dir=None)

Initialise the RunMultiQC task.

Parameters:
  • project (AnalysisProject) – project to generate QC report for

  • qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)

  • fastq_dir (str) – directory holding Fastq files (defaults to current fastq_dir in project)

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.Set10xCellCount(_name, *args, **kws)

Update the number of cells in the project metadata from ‘cellranger* count’ or ‘cellranger multi’ output

init(project, qc_dir=None, tenx_pipeline='cellranger', source='count')

Initialise the Set10xCellCount task.

Parameters:
  • project (AnalysisProject) – project to update the number of cells for

  • qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)

  • tenx_pipeline (str) – one of ‘cellranger’ (the default), ‘cellranger-atac’ or ‘cellranger-arc’

  • source (str) – either ‘count’ (the default) or ‘multi’

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.SetupQCDirs(_name, *args, **kws)

Set up the directories for the QC run

init(project, qc_dir, log_dir=None, protocol=None)

Initialise the SetupQCDirs task

Parameters:
  • project (AnalysisProject) – project to run QC for

  • qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)

  • log_dir (str) – directory for log files (defaults to ‘logs’ subdirectory of the QC directory

  • protocol (QCProject) – QC protocol being used

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.SplitFastqsByLane(_name, *args, **kws)

Split reads into multiple Fastqs according to lane

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, out_dir)

Initialise the SplitFastqsByLane task

Parameters:
  • project (AnalysisProject) – project with source Fastqs to split by lane

  • out_dir (str) – path to directory where split Fastqs will be written

Outputs:
fastqs (list): list of paths to output Fastqs

split by lanes

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.UpdateQCMetadata(_name, *args, **kws)

Update the metadata stored for this QC run

init(project, qc_dir, metadata, legacy_screens=False)

Initialise the UpdateQCMetadata task

Parameters:
  • project (AnalysisProject) – project to run QC for

  • qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)

  • metadata (dict) – mapping of metadata items to values

  • legacy_screens (bool) – if True then ‘legacy’ naming convention was used for FastqScreen outputs

setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.VerifyFastqs(_name, *args, **kws)

Check Fastqs are valid

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, fastq_attrs=None)

Initialise the VerifyFastqs task

Parameters:
setup()

Set up commands to be performed by the task

Must be implemented by the subclass

class auto_process_ngs.qc.pipeline.VerifyQC(_name, *args, **kws)

Verify outputs from the QC pipeline

finish()

Perform actions on task completion

Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.

Must be implemented by the subclass

init(project, qc_dir, protocol, fastqs)

Initialise the VerifyQC task.

Parameters:
  • project (AnalysisProject) – project to update the number of cells for

  • qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)

  • protocol (QCProtocl) – QC protocol to verify against

  • fastqs (list) – Fastqs to include in the verification

setup()

Set up commands to be performed by the task

Must be implemented by the subclass