`auto_process_ngs.qc.utils`

Provides utility classes and functions for analysis project QC.

Provides the following functions:

verify_qc: verify the QC run for a project
report_qc: generate report for the QC run for a project
get_bam_basename: return the BAM file basename from a Fastq filename
get_bam_samplename: return the sample name from a BAM filename
get_seq_data_samples: identify samples with biological (sequencing) data
filter_fastqs: filter list of Fastqs based on read IDs
set_cell_count_for_project: sets total number of cells for a project
read_versions_file: extract software info from ‘versions’ file

auto_process_ngs.qc.utils.filter_fastqs(reads, fastqs, fastq_attrs=<class 'auto_process_ngs.analysis.AnalysisFastq'>)

Filter list of Fastqs and return names matching reads

Parameters:

reads (list) – list of reads to filter (‘r1’, ‘i2’ etc: ‘*’ matches all reads, ‘r*’ matches all data reads, ‘i*’ matches all index reads)
fastqs (list) – list of Fastq files or names to filter
fastq_attrs (BaseFastqAttrs) – class for extracting attribute data from Fastq names

Returns:

matching Fastq names (i.e. no leading: path or trailing extensions)

Return type:

List

auto_process_ngs.qc.utils.get_bam_basename(fastq, fastq_attrs=None)

Return basename for BAM file from Fastq filename

Uses the ‘bam_basename’ method of the supplied ‘BaseFastqAttrs’ subclass (defaults to ‘AnalysisFastq’ if none is specified) to generate the BAM file basename.

Typically the BAM basename will be the Fastq basename with the read ID removed, for example the Fastq filename ‘SM1_S1_L001_R1_001.fastq.gz’ will result in the BAM basename of ‘SM1_S1_L001_001’.

Parameters:

fastq (str) – Fastq filename; can include leading path and extensions (both will be ignored)
fastq_attrs (BaseFastqAttrs) – class for extracting data from Fastq names (defaults to ‘AnalysisFastq’)

Returns:

basename for BAM file.

Return type:

String

auto_process_ngs.qc.utils.get_bam_samplename(bam, fastq_attrs=None)

Return sample name extracted from BAM filename

Parameters:

bam (str) – BAM filename
fastq_attrs (BaseFastqAttrs) – class for extracting
names (data from Fastq) –

Returns:

sample name extracted from BAM filename.

Return type:

String

auto_process_ngs.qc.utils.get_seq_data_samples(project_dir, fastq_attrs=None)

Identify samples with biological (sequencing) data

If a list of samples is explicitly supplied in the project metadata (via the ‘biological_samples’ item) then is returned; otherwise if configuration files are identified for ‘cellranger multi’ then the list of biological samples is taken from these files.

If neither of these things are found (or if the ‘cellranger multi’ config files don’t define any biological samples) then the default is to return the names of all the samples in the project.

Parameters:

project_dir (str) – path to the project directory
fastq_attrs (BaseFastqAttrs) – class for extracting data from Fastq names (defaults to ‘AnalysisFastq’)

Returns:

list with subset of samples with biological: data

Return type:

List

auto_process_ngs.qc.utils.read_versions_file(f, pkgs=None)

Extract software info from ‘versions’ file

‘versions’ files (typically named _versions) should consist of one or more lines of text, with each line comprising a software package name and a version number, separated by a tab character.

Returns a dictionary where package names are keys, and the corresponding values are lists of versions.

If an existing dictionary is supplied via the ‘pkgs’ argument then any package information is added to this dictionary; otherwise an empty dictionary is created and populated.

Parameters:

f (str) – path to ‘versions’ file
pkgs (dict) – optional, dictionary to extend with with information from ‘versions’ file

auto_process_ngs.qc.utils.report_qc(project, qc_dir=None, fastq_dir=None, qc_protocol=None, report_html=None, zip_outputs=True, multiqc=False, out_dir=None, force=False, runner=None, log_dir=None, suppress_warning=False)

Generate report for the QC run for a project

Parameters:

project (AnalysisProject) – analysis project to report the QC for
qc_dir (str) – optional, specify the subdir with the QC outputs being reported
fastq_dir (str) – optional, specify a non-default directory with Fastq files being verified
qc_protocol (str) – optional, QC protocol to verify against
report_html (str) – optional, path to the name of the output QC report
zip_outputs (bool) – if True then also generate ZIP archive with the report and QC outputs
multiqc (bool) – if True then also generate MultiQC report
out_dir (str) – optional, path to the output directory to write the reports to (defaults to the project directory, ignored if report HTML file name is explicitly provided)
force (bool) – if True then force generation of QC report even if verification fails
runner (JobRunner) – optional, job runner to use for running the reporting
log_dir (str) – optional, specify a directory to write logs to
suppress_warning (bool) – if True then don’t show the warning message even when there are missing metrics (default: show the warning if there are missing metrics)

Returns:

exit code from reporting job (zero indicates: success, non-zero indicates a problem).

Return type:

Integer

auto_process_ngs.qc.utils.set_cell_count_for_project(project_dir, qc_dir=None, tenx_pipeline='cellranger', source='count')

Set the total number of cells for a project

Depending on the specified ‘source’, sums the number of cells for each sample in a project as determined from either ‘cellranger* count’ or ‘cellranger multi’.

Depending on the 10x Genomics package and analysis type the cell count for individual samples is extracted from the ‘metrics_summary.csv’ file for scRNA-seq (i.e. ‘cellranger count’ or ‘cellranger multi’), or from the ‘summary.csv’ file for scATAC (ie. ‘cellranger-atac count’).

The final count is written to the ‘number_of_cells’ metadata item for the project.

Parameters:

project_dir (str) – path to the project directory
qc_dir (str) – path to QC directory (if not the default QC directory for the project)
tenx_pipeline (str) – specify the 10x Genomics package, one of: ‘cellranger’, ‘cellranger-atac’, ‘cellranger-arc’ (default is ‘cellranger’)
source (str) – either ‘count’ or ‘multi’ (default is ‘count’)

Returns:

exit code, non-zero values indicate problems: were encountered.

Return type:

Integer

auto_process_ngs.qc.utils.verify_qc(project, qc_dir=None, fastq_dir=None, qc_protocol=None, runner=None, log_dir=None)

Verify the QC run for a project

Parameters:

project (AnalysisProject) – analysis project to verify the QC for
qc_dir (str) – optional, specify the subdir with the QC outputs being verified
fastq_dir (str) – optional, specify a non-default directory with Fastq files being verified
qc_protocol (str) – optional, QC protocol to verify against
runner (JobRunner) – optional, job runner to use for running the verification
log_dir (str) – optional, specify a directory to write logs to

Returns:

True if QC passes verification, otherwise: False.

Return type:

Boolean

auto_process_ngs.qc.utils

`auto_process_ngs.qc.utils`