auto_process_ngs.qc.verification

Utilities for verifying QC pipeline outputs.

Provides the following classes:

  • QCVerifier: enables verification of QC outputs against protocols

Provides the following functions:

  • parse_qc_module_spec: process QC module specification string

  • filter_fastqs: filter list of Fastqs based on read IDs

  • filter_10x_pipelines: filter list of 10xGenomics pipeline tuples

  • verify_project: check the QC outputs for a project

class auto_process_ngs.qc.verification.QCVerifier(qc_dir, fastq_attrs=None)

Class to perform verification of QC outputs

The QCVerifier enables the QC outputs from a directory to be checked against arbitrary QC protocols via its verify method.

For example:

>>> project = AnalysisProject("/data/projects/PJB")
>>> verifier = QCVerifier(project.qc_dir)
>>> verifier.verify(project.fastqs,"standardPE")
True
Parameters:
  • qc_dir (str) – path to directory to examine

  • fastq_attrs (BaseFastqAttrs) – (optional) class for extracting data from Fastq names

filter_fastqs(reads, fastqs)

Filter list of Fastqs and return names matching reads

Wrap external ‘filter_fastqs’ function

identify_seq_data(samples)

Identify samples with sequence (biological) data

Parameters:

samples (list) – list of all sample names

Returns:

subset of sample names with sequence data.

Return type:

List

verify(protocol, fastqs, organism=None, fastq_screens=None, star_index=None, annotation_bed=None, annotation_gtf=None, cellranger_version=None, cellranger_refdata=None, cellranger_use_multi_config=None, seq_data_samples=None)

Verify QC outputs for Fastqs against specified protocol

Parameters:
  • protocol (QCProtocol) – QC protocol to verify against

  • fastqs (list) – list of Fastqs to verify outputs for

  • organism (str) – organism associated with outputs

  • fastq_screens (list) – list of panel names to verify FastqScreen outputs against

  • star_index (str) – path to STAR index

  • annotation_bed (str) – path to BED annotation file

  • annotation_gtf (str) – path to GTF annotation file

  • cellranger_version (str) – specific version of 10x package to check for

  • cellranger_refdata (str) – specific 10x reference dataset to check for

  • cellranger_use_multi_config (bool) – if True then cellranger count verification will attempt to use data (GEX samples and reference dataset) from the ‘10x_multi_config.csv’ file

  • seq_data_samples (list) – list of sample names with sequence (i.e. biological) data

Returns:

True if all expected outputs are present,

False otherwise.

Return type:

Boolean

verify_10x_pipeline(name, pipeline, samples)

Internal: check for and verify outputs for 10x package

Parameters:
  • name (str) – name the QC data is stored under

  • pipeline (tuple) – tuple specifying pipeline(s) to verify

  • samples (list) – list of sample names to verify

Returns:

True if at least one set of valid outputs

exist for the specified pipeline and sample list, False otherwise.

Return type:

Boolean

verify_qc_module(name, fastqs=None, samples=None, seq_data_fastqs=None, seq_data_samples=None, seq_data_reads=None, qc_reads=None, organism=None, fastq_screens=None, star_index=None, annotation_bed=None, annotation_gtf=None, cellranger_version=None, cellranger_refdata=None, cellranger_use_multi_config=None, **extra_params)

Verify QC outputs for specific QC module

Parameters:
  • name (str) – QC module name

  • fastqs (list) – list of Fastqs

  • samples (list) – list of sample names

  • seq_data_fastqs (list) – list of Fastqs with sequence (i.e. biological) data

  • seq_data_samples (list) – list of sample names with sequence (i.e. biological) data

  • seq_data_reads (list) – list of reads containing sequence data

  • qc_reads (list) – list of reads to perform general QC on

  • organism (str) – organism associated with outputs

  • fastq_screens (list) – list of panel names to verify FastqScreen outputs against

  • star_index (str) – path to STAR index

  • annotation_bed (str) – path to BED annotation file

  • annotation_gtf (str) – path to GTF annotation file

  • cellranger_version (str) – specific version of 10x package to check for

  • cellranger_refdata (str) – specific 10x reference dataset to check for

  • cellranger_use_multi_config (bool) – if True then cellranger count verification will attempt to use data (GEX samples and reference dataset) from the ‘10x_multi_config.csv’ file

  • extra_params (mapping) – any additional parameters not required for verification

Returns:

True if all outputs are present, False

if one or more are missing.

Return type:

Boolean

Raises:

Exception – if the specified QC module name is not recognised.

auto_process_ngs.qc.verification.filter_10x_pipelines(p, pipelines)

Filter list of 10x pipelines

Pipelines are described using tuples of the form:

(NAME,VERSION,REFERENCE)

for example:

(‘cellranger’,’6.1.2’,’refdata-gex-2020’)

Only pipelines matching the specified name, version and reference data will be included in the returned list.

Where the supplied version or reference dataset name are either None or ‘*’, these will match any version and/or reference dataset.

Parameters:
  • p (tuple) – tuple specifying pipeline(s) to match against

  • pipelines (list) – list of pipeline tuples to filter

Returns:

list of matching 10x pipeline tuples.

Return type:

List

auto_process_ngs.qc.verification.filter_fastqs(reads, fastqs, fastq_attrs=<class 'auto_process_ngs.analysis.AnalysisFastq'>)

Filter list of Fastqs and return names matching reads

Parameters:
  • reads (list) – list of reads to filter (‘r1’, ‘i2’ etc: ‘*’ matches all reads, ‘r*’ matches all data reads, ‘i*’ matches all index reads)

  • fastqs (list) – list of Fastq files or names to filter

  • fastq_attrs (BaseFastqAttrs) – class for extracting attribute data from Fastq names

Returns:

matching Fastq names (i.e. no leading

path or trailing extensions)

Return type:

List

auto_process_ngs.qc.verification.parse_qc_module_spec(module_spec)

Parse QC module spec into name and parameters

Parse a QC module specification of the form NAME or NAME(KEY=VALUE;...) and return the module name and any additional parameters in the form of a dictionary.

For example:

>>> parse_qc_module_spec('NAME')
('NAME', {})
>>> parse_qc_module_spec('NAME(K1=V1;K2=V2)')
('NAME', { 'K1':'V1', 'K2':'V2' })

By default values are returned as strings (with surrounding single or double quotes removed); however basic type conversion is also applied to certain values:

  • True/true and False/false are returned as the appropriate boolean value

Parameters:

module_spec (str) – QC module specification

Returns:

tuple of the form (name,params) where

’name’ is the QC module name and ‘params’ is a dictionary with the extracted key-value pairs.

Return type:

Tuple

auto_process_ngs.qc.verification.verify_project(project, qc_dir=None, qc_protocol=None, fastqs=None)

Check the QC outputs are correct for a project

Parameters:
  • project (AnalysisProject) – project to verify QC for

  • qc_dir (str) – path to the QC output dir; relative path will be treated as a subdirectory of the project being checked.

  • qc_protocol (str) – QC protocol name or specification to verify against (optional)

  • fastqs – list of Fastqs to include (optional, defaults to Fastqs in the project)