auto_process_ngs.qc.utils

Provides utility classes and functions for analysis project QC.

Provides the following functions:

  • verify_qc: verify the QC run for a project

  • report_qc: generate report for the QC run for a project

  • get_bam_basename: return the BAM file basename from a Fastq filename

  • get_bam_samplename: return the sample name from a BAM filename

  • get_seq_data_samples: identify samples with biological (sequencing) data

  • filter_fastqs: filter list of Fastqs based on read IDs

  • set_cell_count_for_project: sets total number of cells for a project

  • read_versions_file: extract software info from ‘versions’ file

auto_process_ngs.qc.utils.filter_fastqs(reads, fastqs, fastq_attrs=<class 'auto_process_ngs.analysis.AnalysisFastq'>)

Filter list of Fastqs and return names matching reads

Parameters:
  • reads (list) – list of reads to filter (‘r1’, ‘i2’ etc: ‘*’ matches all reads, ‘r*’ matches all data reads, ‘i*’ matches all index reads)

  • fastqs (list) – list of Fastq files or names to filter

  • fastq_attrs (BaseFastqAttrs) – class for extracting attribute data from Fastq names

Returns:

matching Fastq names (i.e. no leading

path or trailing extensions)

Return type:

List

auto_process_ngs.qc.utils.get_bam_basename(fastq, fastq_attrs=None)

Return basename for BAM file from Fastq filename

Uses the ‘bam_basename’ method of the supplied ‘BaseFastqAttrs’ subclass (defaults to ‘AnalysisFastq’ if none is specified) to generate the BAM file basename.

Typically the BAM basename will be the Fastq basename with the read ID removed, for example the Fastq filename ‘SM1_S1_L001_R1_001.fastq.gz’ will result in the BAM basename of ‘SM1_S1_L001_001’.

Parameters:
  • fastq (str) – Fastq filename; can include leading path and extensions (both will be ignored)

  • fastq_attrs (BaseFastqAttrs) – class for extracting data from Fastq names (defaults to ‘AnalysisFastq’)

Returns:

basename for BAM file.

Return type:

String

auto_process_ngs.qc.utils.get_bam_samplename(bam, fastq_attrs=None)

Return sample name extracted from BAM filename

Parameters:
  • bam (str) – BAM filename

  • fastq_attrs (BaseFastqAttrs) – class for extracting

  • names (data from Fastq) –

Returns:

sample name extracted from BAM filename.

Return type:

String

auto_process_ngs.qc.utils.get_seq_data_samples(project_dir, fastq_attrs=None)

Identify samples with biological (sequencing) data

If a list of samples is explicitly supplied in the project metadata (via the ‘biological_samples’ item) then is returned; otherwise if configuration files are identified for ‘cellranger multi’ then the list of biological samples is taken from these files.

If neither of these things are found (or if the ‘cellranger multi’ config files don’t define any biological samples) then the default is to return the names of all the samples in the project.

Parameters:
  • project_dir (str) – path to the project directory

  • fastq_attrs (BaseFastqAttrs) – class for extracting data from Fastq names (defaults to ‘AnalysisFastq’)

Returns:

list with subset of samples with biological

data

Return type:

List

auto_process_ngs.qc.utils.read_versions_file(f, pkgs=None)

Extract software info from ‘versions’ file

‘versions’ files (typically named _versions) should consist of one or more lines of text, with each line comprising a software package name and a version number, separated by a tab character.

Returns a dictionary where package names are keys, and the corresponding values are lists of versions.

If an existing dictionary is supplied via the ‘pkgs’ argument then any package information is added to this dictionary; otherwise an empty dictionary is created and populated.

Parameters:
  • f (str) – path to ‘versions’ file

  • pkgs (dict) – optional, dictionary to extend with with information from ‘versions’ file

auto_process_ngs.qc.utils.report_qc(project, qc_dir=None, fastq_dir=None, qc_protocol=None, report_html=None, zip_outputs=True, multiqc=False, out_dir=None, force=False, runner=None, log_dir=None, suppress_warning=False)

Generate report for the QC run for a project

Parameters:
  • project (AnalysisProject) – analysis project to report the QC for

  • qc_dir (str) – optional, specify the subdir with the QC outputs being reported

  • fastq_dir (str) – optional, specify a non-default directory with Fastq files being verified

  • qc_protocol (str) – optional, QC protocol to verify against

  • report_html (str) – optional, path to the name of the output QC report

  • zip_outputs (bool) – if True then also generate ZIP archive with the report and QC outputs

  • multiqc (bool) – if True then also generate MultiQC report

  • out_dir (str) – optional, path to the output directory to write the reports to (defaults to the project directory, ignored if report HTML file name is explicitly provided)

  • force (bool) – if True then force generation of QC report even if verification fails

  • runner (JobRunner) – optional, job runner to use for running the reporting

  • log_dir (str) – optional, specify a directory to write logs to

  • suppress_warning (bool) – if True then don’t show the warning message even when there are missing metrics (default: show the warning if there are missing metrics)

Returns:

exit code from reporting job (zero indicates

success, non-zero indicates a problem).

Return type:

Integer

auto_process_ngs.qc.utils.set_cell_count_for_project(project_dir, qc_dir=None, tenx_pipeline='cellranger', source='count')

Set the total number of cells for a project

Depending on the specified ‘source’, sums the number of cells for each sample in a project as determined from either ‘cellranger* count’ or ‘cellranger multi’.

Depending on the 10x Genomics package and analysis type the cell count for individual samples is extracted from the ‘metrics_summary.csv’ file for scRNA-seq (i.e. ‘cellranger count’ or ‘cellranger multi’), or from the ‘summary.csv’ file for scATAC (ie. ‘cellranger-atac count’).

The final count is written to the ‘number_of_cells’ metadata item for the project.

Parameters:
  • project_dir (str) – path to the project directory

  • qc_dir (str) – path to QC directory (if not the default QC directory for the project)

  • tenx_pipeline (str) – specify the 10x Genomics package, one of: ‘cellranger’, ‘cellranger-atac’, ‘cellranger-arc’ (default is ‘cellranger’)

  • source (str) – either ‘count’ or ‘multi’ (default is ‘count’)

Returns:

exit code, non-zero values indicate problems

were encountered.

Return type:

Integer

auto_process_ngs.qc.utils.verify_qc(project, qc_dir=None, fastq_dir=None, qc_protocol=None, runner=None, log_dir=None)

Verify the QC run for a project

Parameters:
  • project (AnalysisProject) – analysis project to verify the QC for

  • qc_dir (str) – optional, specify the subdir with the QC outputs being verified

  • fastq_dir (str) – optional, specify a non-default directory with Fastq files being verified

  • qc_protocol (str) – optional, QC protocol to verify against

  • runner (JobRunner) – optional, job runner to use for running the verification

  • log_dir (str) – optional, specify a directory to write logs to

Returns:

True if QC passes verification, otherwise

False.

Return type:

Boolean