auto_process_ngs.qc.pipeline
Pipeline components for running the QC pipeline.
Pipeline classes:
QCPipeline
Pipeline task classes:
SetupQCDirs
SplitFastqsByLane
GetSequenceDataSamples
GetSequenceDataFastqs
UpdateQCMetadata
VerifyFastqs
Set10xCellCount
GetReferenceDataset
GetBAMFile
ConvertGTFToBed
VerifyQC
ReportQC
- class auto_process_ngs.qc.pipeline.ConvertGTFToBed(_name, *args, **kws)
Convert a GTF file to a BED file using BEDOPS ‘gtf2bed’
- finish()
Perform actions on task completion
Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.
Must be implemented by the subclass
- init(gtf_in, bed_out)
Initialise the ConvertGTFToBed task
- Parameters:
gtf_in (str) – path to the input GTF file
bed_out (str) – path to the output BED file
- setup()
Set up commands to be performed by the task
Must be implemented by the subclass
- class auto_process_ngs.qc.pipeline.GetBAMFiles(_name, *args, **kws)
Create BAM files from Fastqs using STAR
Runs STAR to generate BAM files from Fastq files. The BAMs are then sorted and indexed using samtools.
- finish()
Perform actions on task completion
Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.
Must be implemented by the subclass
- init(fastqs, star_index, out_dir, subset_size=None, nthreads=None, reads=None, include_samples=None, fastq_attrs=None, verbose=False)
Initialise the GetBamFiles task
- Parameters:
fastqs (list) – list of Fastq files to generate BAM files from
star_index (str) – path to STAR index to use
out_dir (str) – path to directory to write final BAM files to
subset_size (int) – specify size of a random subset of reads to use in BAM file generation
nthreads (int) – number of cores for STAR to use
reads (list) – optional, list of read numbers to include (e.g. [1,2], [2] etc)
include_samples (list) – optional, list of sample names to include
fastq_attrs (IlluminaFastq) – optional, class to use for extracting information from Fastq file names
verbose (bool) – if True then print additional information from the task
- Outputs:
bam_files: list of sorted BAM files
- setup()
Set up commands to be performed by the task
Must be implemented by the subclass
- class auto_process_ngs.qc.pipeline.GetReferenceDataset(_name, *args, **kws)
Acquire reference data for an organism from mapping
Generic lookup task which attempts to locate the matching reference dataset from a mapping/dictionary.
- init(organism, references, force_reference=None)
Initialise the GetReferenceDataset task
- Parameters:
organism (str) – name of the organism
references (mapping) – mapping with organism names as keys and reference datasets as corresponding values
force_reference (str) – if specified then return the supplied value instead of determining from the organism
- Outputs:
- reference_dataset: reference dataset (set to None
if no dataset could be located)
- setup()
Set up commands to be performed by the task
Must be implemented by the subclass
- class auto_process_ngs.qc.pipeline.GetSequenceDataFastqs(_name, *args, **kws)
Set up Fastqs with sequence (i.e. biological) data
- init(project, out_dir, read_range, samples, fastq_attrs, fastqs=None)
Initialise the GetSequenceDataFastqs task
- Parameters:
project (AnalysisProject) – project to get Fastqs for
out_dir (str) – path to directory to write final Fastq files to
read_range (dict) – mapping of read names to tuples of subsequence ranges
samples (list) – list of samples with sequence data
fastq_attrs (BaseFastqAttrs) – class to use for extracting data from Fastq names
fastqs (list) – optional, list of Fastq files (overrides Fastqs in project)
- Outputs:
- fastqs (list): list of Fastqs with biological
data
- setup()
Set up commands to be performed by the task
Must be implemented by the subclass
- class auto_process_ngs.qc.pipeline.GetSequenceDataSamples(_name, *args, **kws)
Identify samples with sequence (i.e. biological) data
- init(project, fastq_attrs)
Initialise the GetSequenceDataSamples task
- Parameters:
project (AnalysisProject) – project to get samples for
fastq_attrs (BaseFastqAttrs) – class to use for extracting data from Fastq names
- Outputs:
- seq_data_samples (list): list of samples with
biological data
- setup()
Set up commands to be performed by the task
Must be implemented by the subclass
- class auto_process_ngs.qc.pipeline.QCPipeline
Run the QC pipeline on one or more projects
Pipeline to run QC on multiple projects.
Example usage:
>>> qc = QCPipeline() >>> qc.add_project(AnalysisProject("AB","./AB") >>> qc.add_project(AnalysisProject("CDE","./CDE") >>> qc.run()
- add_project(project, protocol, qc_dir=None, organism=None, fastq_dir=None, report_html=None, multiqc=False, sample_pattern=None, log_dir=None, convert_gtf=True, verify_fastqs=False, split_fastqs_by_lane=False)
Add a project to the QC pipeline
- Parameters:
project (AnalysisProject) – project to run QC for
protocol (QCProtocol) – QC protocol to use
qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)
organism (str) – organism(s) for project (defaults to organism defined in project metadata)
fastq_dir (str) – directory holding Fastq files (defaults to primary fastq_dir in project)
multiqc (bool) – if True then also run MultiQC (default is not to run MultiQC)
sample_pattern (str) – glob-style pattern to match a subset of projects and samples (not implemented)
log_dir (str) – directory to write log files to (defaults to ‘logs’ subdirectory of the QC directory)
convert_gtf (bool) – if True then convert GTF files to BED for ‘infer_experiment.py’ (default; otherwise only use the explicitly defined BED files)
verify_fastqs (bool) – if True then verify Fastq integrity as part of the pipeline (default: False, skip verification)
split_fastqs_by_lanes (bool) – if True then split input Fastqs into lanes and run QC as per-lane (default: False, don’t split QC by lanes)
- add_task(task, requires=(), **kws)
Override base class method
Automatically set log dir when tasks are added
- property default_log_dir
Return current value of default log dir
- run(nthreads=None, fastq_screens=None, star_indexes=None, annotation_bed_files=None, annotation_gtf_files=None, fastq_subset=None, cellranger_chemistry='auto', cellranger_force_cells=None, cellranger_transcriptomes=None, cellranger_premrna_references=None, cellranger_atac_references=None, cellranger_arc_references=None, cellranger_jobmode='local', cellranger_maxjobs=None, cellranger_mempercore=None, cellranger_jobinterval=None, cellranger_localcores=None, cellranger_localmem=None, cellranger_exe=None, cellranger_extra_projects=None, cellranger_reference_dataset=None, cellranger_out_dir=None, force_star_index=None, force_gtf_annotation=None, working_dir=None, log_file=None, batch_size=None, batch_limit=None, max_jobs=1, max_slots=None, poll_interval=5, runners=None, default_runner=None, enable_conda=False, conda=None, conda_env_dir=None, envmodules=None, shorten_zip_paths=False, legacy_screens=False, verbose=False)
Run the tasks in the pipeline
- Parameters:
nthreads (int) – number of threads/processors to use for QC jobs (defaults to number of slots set in job runners)
fastq_screens (dict) – mapping of screen IDs to FastqScreen conf files
star_indexes (dict) – mapping of organism IDs to directories with STAR indexes
annotation_bed_files (dict) – mapping of organism IDs to BED files with annotation data
annotation_gtf_files (dict) – mapping of organism IDs to GTF files with annotation data
fastq_subset (int) – explicitly specify the subset size for subsetting running Fastqs
cellranger_chemistry (str) – explicitly specify the assay configuration (set to ‘auto’ to let cellranger determine this automatically; ignored if not scRNA-seq)
force_cells (int) – explicitly specify number of cells for ‘cellranger’ and ‘cellranger-atac’ (set to ‘None’ to use the cell detection algorithm; ignored for ‘cellranger-arc’)
cellranger_transcriptomes (mapping) – mapping of organism names to reference transcriptome data for cellranger
cellranger_premrna_references (mapping) – mapping of organism names to “pre-mRNA” reference data for cellranger
cellranger_atac_references (mapping) – mapping of organism names to ATAC-seq reference genome data for cellranger-atac
cellranger_arc_references (mapping) – mapping of organism names to multiome reference datasets for cellranger-arc
cellranger_jobmode (str) – specify the job mode to pass to cellranger (default: “local”)
cellranger_maxjobs (int) – specify the maximum number of jobs to pass to cellranger (default: None)
cellranger_mempercore (int) – specify the memory per core (in Gb) to pass to cellranger (default: None)
cellranger_jobinterval (int) – specify the interval between launching jobs (in ms) to pass to cellranger (default: None)
cellranger_localcores (int) – maximum number of cores cellranger can request in jobmode ‘local’ (default: None)
cellranger_localmem (int) – maximum memory cellranger can request in jobmode ‘local’ (default: None)
cellranger_exe (str) – optional, explicitly specify the cellranger executable to use for single library analysis (default: cellranger executable is determined automatically)
cellranger_extra_projects (list) – optional list of extra AnalysisProjects to include Fastqs from when running cellranger pipeline
cellranger_reference_dataset (str) – optional, explicitly specify the path to the reference dataset to use for single library analysis (default: reference dataset is determined automatically)
cellranger_out_dir (str) – specify directory to put full cellranger outputs into (default: project directory)
force_star_index (str) – explicitly specify STAR index to use (default: index is determined automatically)
force_gtf_annotation (str) – explicitly specify GTF annotation to use (default: annotation file is determined automatically)
working_dir (str) – optional path to a working directory (defaults to temporary directory in the current directory)
log_dir (str) – path of directory where log files will be written to
batch_size (int) – if set then run commands in each task in batches, with each batch running this many commands at a time (default is to run one command per job)
batch_limit (int) – if set then run commands in each task in batches, with the batch size set dyanmically so as not to exceed this limit (default is to use fixed batch sizes)
max_jobs (int) – optional maximum number of concurrent jobs in scheduler (defaults to 1)
max_slots (int) – optional maximum number of ‘slots’ (i.e. concurrent threads or maximum number of CPUs) available to the scheduler (defaults to no limit)
poll_interval (float) – optional polling interval (seconds) to set in scheduler (defaults to 5s)
runners (dict) – mapping of names to JobRunner instances; valid names are ‘fastqc_runner’, ‘fastq_screen_runner’,’star_runner’,’rseqc_runner’, ‘qualimap_runner’,’cellranger_count_runner’, ‘cellranger_multi_runner’,’report_runner’, ‘verify_runner’, and ‘default’
enable_conda (bool) – if True then enable use of conda environments to satisfy task dependencies
conda (str) – path to conda
conda_env_dir (str) – path to non-default directory for conda environments
envmodules (mapping) – mapping of names to environment module file lists; valid names are ‘fastqc’,’fastq_screen’,’fastq_strand’,’cellranger’, ‘report_qc’
default_runner (JobRunner) – optional default job runner to use
shorten_zip_paths (bool) – if True then rewrite the the file paths in the ZIP QC reports to shorten them for systems without long filename support
legacy_screens (bool) – if True then use ‘legacy’ naming convention for FastqScreen outputs
verbose (bool) – if True then report additional information for diagnostics
- set_default_log_dir(log_dir)
Set the default log directory for tasks
- class auto_process_ngs.qc.pipeline.ReportQC(_name, *args, **kws)
Generate the QC report
- init(project, qc_dir, report_html=None, fastq_dir=None, force=False, zip_outputs=True, shorten_zip_paths=False)
Initialise the ReportQC task.
- Parameters:
project (AnalysisProject) – project to generate QC report for
qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)
report_html (str) – set the name of the output HTML file for the report
fastq_dir (str) – directory holding Fastq files (defaults to current fastq_dir in project)
force (bool) – if True then force HTML report to be generated even if QC outputs fail verification (default: don’t write report)
zip_outputs (bool) – if True then also generate a ZIP archive of the QC reports
shorten_zip_paths (bool) – if True then rewrites file paths in ZIP archive to short versions (default: don’t shorten paths)
- setup()
Set up commands to be performed by the task
Must be implemented by the subclass
- class auto_process_ngs.qc.pipeline.RunMultiQC(_name, *args, **kws)
Run MultiQC
- finish()
Perform actions on task completion
Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.
Must be implemented by the subclass
- init(project, qc_dir, fastq_dir=None)
Initialise the RunMultiQC task.
- Parameters:
project (AnalysisProject) – project to generate QC report for
qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)
fastq_dir (str) – directory holding Fastq files (defaults to current fastq_dir in project)
- setup()
Set up commands to be performed by the task
Must be implemented by the subclass
- class auto_process_ngs.qc.pipeline.Set10xCellCount(_name, *args, **kws)
Update the number of cells in the project metadata from ‘cellranger* count’ or ‘cellranger multi’ output
- init(project, qc_dir=None, tenx_pipeline='cellranger', source='count')
Initialise the Set10xCellCount task.
- Parameters:
project (AnalysisProject) – project to update the number of cells for
qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)
tenx_pipeline (str) – one of ‘cellranger’ (the default), ‘cellranger-atac’ or ‘cellranger-arc’
source (str) – either ‘count’ (the default) or ‘multi’
- setup()
Set up commands to be performed by the task
Must be implemented by the subclass
- class auto_process_ngs.qc.pipeline.SetupQCDirs(_name, *args, **kws)
Set up the directories for the QC run
- init(project, qc_dir, log_dir=None, protocol=None)
Initialise the SetupQCDirs task
- Parameters:
project (AnalysisProject) – project to run QC for
qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)
log_dir (str) – directory for log files (defaults to ‘logs’ subdirectory of the QC directory
protocol (QCProject) – QC protocol being used
- setup()
Set up commands to be performed by the task
Must be implemented by the subclass
- class auto_process_ngs.qc.pipeline.SplitFastqsByLane(_name, *args, **kws)
Split reads into multiple Fastqs according to lane
- finish()
Perform actions on task completion
Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.
Must be implemented by the subclass
- init(project, out_dir)
Initialise the SplitFastqsByLane task
- Parameters:
project (AnalysisProject) – project with source Fastqs to split by lane
out_dir (str) – path to directory where split Fastqs will be written
- Outputs:
- fastqs (list): list of paths to output Fastqs
split by lanes
- setup()
Set up commands to be performed by the task
Must be implemented by the subclass
- class auto_process_ngs.qc.pipeline.UpdateQCMetadata(_name, *args, **kws)
Update the metadata stored for this QC run
- init(project, qc_dir, metadata, legacy_screens=False)
Initialise the UpdateQCMetadata task
- Parameters:
project (AnalysisProject) – project to run QC for
qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)
metadata (dict) – mapping of metadata items to values
legacy_screens (bool) – if True then ‘legacy’ naming convention was used for FastqScreen outputs
- setup()
Set up commands to be performed by the task
Must be implemented by the subclass
- class auto_process_ngs.qc.pipeline.VerifyFastqs(_name, *args, **kws)
Check Fastqs are valid
- finish()
Perform actions on task completion
Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.
Must be implemented by the subclass
- init(project, fastq_attrs=None)
Initialise the VerifyFastqs task
- Parameters:
project (AnalysisProject) – project with Fastqs to check
fastq_attrs (BaseFastqAttrs) – class to use for extracting data from Fastq names
- setup()
Set up commands to be performed by the task
Must be implemented by the subclass
- class auto_process_ngs.qc.pipeline.VerifyQC(_name, *args, **kws)
Verify outputs from the QC pipeline
- finish()
Perform actions on task completion
Performs any actions that are required on completion of the task, such as moving or copying data, and setting the values of any output parameters.
Must be implemented by the subclass
- init(project, qc_dir, protocol, fastqs)
Initialise the VerifyQC task.
- Parameters:
project (AnalysisProject) – project to update the number of cells for
qc_dir (str) – directory for QC outputs (defaults to subdirectory ‘qc’ of project directory)
protocol (QCProtocl) – QC protocol to verify against
fastqs (list) – Fastqs to include in the verification
- setup()
Set up commands to be performed by the task
Must be implemented by the subclass