auto_process_ngs.qc.verification
Utilities for verifying QC pipeline outputs.
Provides the following classes:
QCVerifier: enables verification of QC outputs against protocols
Provides the following functions:
parse_qc_module_spec: process QC module specification string
filter_fastqs: filter list of Fastqs based on read IDs
filter_10x_pipelines: filter list of 10xGenomics pipeline tuples
verify_project: check the QC outputs for a project
- class auto_process_ngs.qc.verification.QCVerifier(qc_dir, fastq_attrs=None)
Class to perform verification of QC outputs
The QCVerifier enables the QC outputs from a directory to be checked against arbitrary QC protocols via its
verify
method.For example:
>>> project = AnalysisProject("/data/projects/PJB") >>> verifier = QCVerifier(project.qc_dir) >>> verifier.verify(project.fastqs,"standardPE") True
- Parameters:
qc_dir (str) – path to directory to examine
fastq_attrs (BaseFastqAttrs) – (optional) class for extracting data from Fastq names
- filter_fastqs(reads, fastqs)
Filter list of Fastqs and return names matching reads
Wrap external ‘filter_fastqs’ function
- identify_seq_data(samples)
Identify samples with sequence (biological) data
- Parameters:
samples (list) – list of all sample names
- Returns:
subset of sample names with sequence data.
- Return type:
- verify(protocol, fastqs, organism=None, fastq_screens=None, star_index=None, annotation_bed=None, annotation_gtf=None, cellranger_version=None, cellranger_refdata=None, cellranger_use_multi_config=None, seq_data_samples=None)
Verify QC outputs for Fastqs against specified protocol
- Parameters:
protocol (QCProtocol) – QC protocol to verify against
fastqs (list) – list of Fastqs to verify outputs for
organism (str) – organism associated with outputs
fastq_screens (list) – list of panel names to verify FastqScreen outputs against
star_index (str) – path to STAR index
annotation_bed (str) – path to BED annotation file
annotation_gtf (str) – path to GTF annotation file
cellranger_version (str) – specific version of 10x package to check for
cellranger_refdata (str) – specific 10x reference dataset to check for
cellranger_use_multi_config (bool) – if True then cellranger count verification will attempt to use data (GEX samples and reference dataset) from the ‘10x_multi_config.csv’ file
seq_data_samples (list) – list of sample names with sequence (i.e. biological) data
- Returns:
- True if all expected outputs are present,
False otherwise.
- Return type:
Boolean
- verify_10x_pipeline(name, pipeline, samples)
Internal: check for and verify outputs for 10x package
- Parameters:
name (str) – name the QC data is stored under
pipeline (tuple) – tuple specifying pipeline(s) to verify
samples (list) – list of sample names to verify
- Returns:
- True if at least one set of valid outputs
exist for the specified pipeline and sample list, False otherwise.
- Return type:
Boolean
- verify_qc_module(name, fastqs=None, samples=None, seq_data_fastqs=None, seq_data_samples=None, seq_data_reads=None, qc_reads=None, organism=None, fastq_screens=None, star_index=None, annotation_bed=None, annotation_gtf=None, cellranger_version=None, cellranger_refdata=None, cellranger_use_multi_config=None, **extra_params)
Verify QC outputs for specific QC module
- Parameters:
name (str) – QC module name
fastqs (list) – list of Fastqs
samples (list) – list of sample names
seq_data_fastqs (list) – list of Fastqs with sequence (i.e. biological) data
seq_data_samples (list) – list of sample names with sequence (i.e. biological) data
seq_data_reads (list) – list of reads containing sequence data
qc_reads (list) – list of reads to perform general QC on
organism (str) – organism associated with outputs
fastq_screens (list) – list of panel names to verify FastqScreen outputs against
star_index (str) – path to STAR index
annotation_bed (str) – path to BED annotation file
annotation_gtf (str) – path to GTF annotation file
cellranger_version (str) – specific version of 10x package to check for
cellranger_refdata (str) – specific 10x reference dataset to check for
cellranger_use_multi_config (bool) – if True then cellranger count verification will attempt to use data (GEX samples and reference dataset) from the ‘10x_multi_config.csv’ file
extra_params (mapping) – any additional parameters not required for verification
- Returns:
- True if all outputs are present, False
if one or more are missing.
- Return type:
Boolean
- Raises:
Exception – if the specified QC module name is not recognised.
- auto_process_ngs.qc.verification.filter_10x_pipelines(p, pipelines)
Filter list of 10x pipelines
Pipelines are described using tuples of the form:
(NAME,VERSION,REFERENCE)
for example:
(‘cellranger’,’6.1.2’,’refdata-gex-2020’)
Only pipelines matching the specified name, version and reference data will be included in the returned list.
Where the supplied version or reference dataset name are either None or ‘*’, these will match any version and/or reference dataset.
- Parameters:
p (tuple) – tuple specifying pipeline(s) to match against
pipelines (list) – list of pipeline tuples to filter
- Returns:
list of matching 10x pipeline tuples.
- Return type:
- auto_process_ngs.qc.verification.filter_fastqs(reads, fastqs, fastq_attrs=<class 'auto_process_ngs.analysis.AnalysisFastq'>)
Filter list of Fastqs and return names matching reads
- Parameters:
reads (list) – list of reads to filter (‘r1’, ‘i2’ etc: ‘*’ matches all reads, ‘r*’ matches all data reads, ‘i*’ matches all index reads)
fastqs (list) – list of Fastq files or names to filter
fastq_attrs (BaseFastqAttrs) – class for extracting attribute data from Fastq names
- Returns:
- matching Fastq names (i.e. no leading
path or trailing extensions)
- Return type:
- auto_process_ngs.qc.verification.parse_qc_module_spec(module_spec)
Parse QC module spec into name and parameters
Parse a QC module specification of the form
NAME
orNAME(KEY=VALUE;...)
and return the module name and any additional parameters in the form of a dictionary.For example:
>>> parse_qc_module_spec('NAME') ('NAME', {}) >>> parse_qc_module_spec('NAME(K1=V1;K2=V2)') ('NAME', { 'K1':'V1', 'K2':'V2' })
By default values are returned as strings (with surrounding single or double quotes removed); however basic type conversion is also applied to certain values:
True/true and False/false are returned as the appropriate boolean value
- Parameters:
module_spec (str) – QC module specification
- Returns:
- tuple of the form (name,params) where
’name’ is the QC module name and ‘params’ is a dictionary with the extracted key-value pairs.
- Return type:
Tuple
- auto_process_ngs.qc.verification.verify_project(project, qc_dir=None, qc_protocol=None, fastqs=None)
Check the QC outputs are correct for a project
- Parameters:
project (AnalysisProject) – project to verify QC for
qc_dir (str) – path to the QC output dir; relative path will be treated as a subdirectory of the project being checked.
qc_protocol (str) – QC protocol name or specification to verify against (optional)
fastqs – list of Fastqs to include (optional, defaults to Fastqs in the project)