auto_process_ngs.qc.utils
Provides utility classes and functions for analysis project QC.
Provides the following functions:
verify_qc: verify the QC run for a project
report_qc: generate report for the QC run for a project
get_bam_basename: return the BAM file basename from a Fastq filename
get_bam_samplename: return the sample name from a BAM filename
get_seq_data_samples: identify samples with biological (sequencing) data
filter_fastqs: filter list of Fastqs based on read IDs
set_cell_count_for_project: sets total number of cells for a project
read_versions_file: extract software info from ‘versions’ file
- auto_process_ngs.qc.utils.filter_fastqs(reads, fastqs, fastq_attrs=<class 'auto_process_ngs.analysis.AnalysisFastq'>)
Filter list of Fastqs and return names matching reads
- Parameters:
reads (list) – list of reads to filter (‘r1’, ‘i2’ etc: ‘*’ matches all reads, ‘r*’ matches all data reads, ‘i*’ matches all index reads)
fastqs (list) – list of Fastq files or names to filter
fastq_attrs (BaseFastqAttrs) – class for extracting attribute data from Fastq names
- Returns:
- matching Fastq names (i.e. no leading
path or trailing extensions)
- Return type:
- auto_process_ngs.qc.utils.get_bam_basename(fastq, fastq_attrs=None)
Return basename for BAM file from Fastq filename
Uses the ‘bam_basename’ method of the supplied ‘BaseFastqAttrs’ subclass (defaults to ‘AnalysisFastq’ if none is specified) to generate the BAM file basename.
Typically the BAM basename will be the Fastq basename with the read ID removed, for example the Fastq filename ‘SM1_S1_L001_R1_001.fastq.gz’ will result in the BAM basename of ‘SM1_S1_L001_001’.
- Parameters:
fastq (str) – Fastq filename; can include leading path and extensions (both will be ignored)
fastq_attrs (BaseFastqAttrs) – class for extracting data from Fastq names (defaults to ‘AnalysisFastq’)
- Returns:
basename for BAM file.
- Return type:
String
- auto_process_ngs.qc.utils.get_bam_samplename(bam, fastq_attrs=None)
Return sample name extracted from BAM filename
- Parameters:
bam (str) – BAM filename
fastq_attrs (BaseFastqAttrs) – class for extracting
names (data from Fastq) –
- Returns:
sample name extracted from BAM filename.
- Return type:
String
- auto_process_ngs.qc.utils.get_seq_data_samples(project_dir, fastq_attrs=None)
Identify samples with biological (sequencing) data
If a list of samples is explicitly supplied in the project metadata (via the ‘biological_samples’ item) then is returned; otherwise if configuration files are identified for ‘cellranger multi’ then the list of biological samples is taken from these files.
If neither of these things are found (or if the ‘cellranger multi’ config files don’t define any biological samples) then the default is to return the names of all the samples in the project.
- Parameters:
project_dir (str) – path to the project directory
fastq_attrs (BaseFastqAttrs) – class for extracting data from Fastq names (defaults to ‘AnalysisFastq’)
- Returns:
- list with subset of samples with biological
data
- Return type:
- auto_process_ngs.qc.utils.read_versions_file(f, pkgs=None)
Extract software info from ‘versions’ file
‘versions’ files (typically named
_versions) should consist of one or more lines of text, with each line comprising a software package name and a version number, separated by a tab character.Returns a dictionary where package names are keys, and the corresponding values are lists of versions.
If an existing dictionary is supplied via the ‘pkgs’ argument then any package information is added to this dictionary; otherwise an empty dictionary is created and populated.
- Parameters:
f (str) – path to ‘versions’ file
pkgs (dict) – optional, dictionary to extend with with information from ‘versions’ file
- auto_process_ngs.qc.utils.report_qc(project, qc_dir=None, fastq_dir=None, qc_protocol=None, report_html=None, zip_outputs=True, multiqc=False, out_dir=None, force=False, runner=None, log_dir=None, suppress_warning=False)
Generate report for the QC run for a project
- Parameters:
project (AnalysisProject) – analysis project to report the QC for
qc_dir (str) – optional, specify the subdir with the QC outputs being reported
fastq_dir (str) – optional, specify a non-default directory with Fastq files being verified
qc_protocol (str) – optional, QC protocol to verify against
report_html (str) – optional, path to the name of the output QC report
zip_outputs (bool) – if True then also generate ZIP archive with the report and QC outputs
multiqc (bool) – if True then also generate MultiQC report
out_dir (str) – optional, path to the output directory to write the reports to (defaults to the project directory, ignored if report HTML file name is explicitly provided)
force (bool) – if True then force generation of QC report even if verification fails
runner (JobRunner) – optional, job runner to use for running the reporting
log_dir (str) – optional, specify a directory to write logs to
suppress_warning (bool) – if True then don’t show the warning message even when there are missing metrics (default: show the warning if there are missing metrics)
- Returns:
- exit code from reporting job (zero indicates
success, non-zero indicates a problem).
- Return type:
Integer
- auto_process_ngs.qc.utils.set_cell_count_for_project(project_dir, qc_dir=None, tenx_pipeline='cellranger', source='count')
Set the total number of cells for a project
Depending on the specified ‘source’, sums the number of cells for each sample in a project as determined from either ‘cellranger* count’ or ‘cellranger multi’.
Depending on the 10x Genomics package and analysis type the cell count for individual samples is extracted from the ‘metrics_summary.csv’ file for scRNA-seq (i.e. ‘cellranger count’ or ‘cellranger multi’), or from the ‘summary.csv’ file for scATAC (ie. ‘cellranger-atac count’).
The final count is written to the ‘number_of_cells’ metadata item for the project.
- Parameters:
project_dir (str) – path to the project directory
qc_dir (str) – path to QC directory (if not the default QC directory for the project)
tenx_pipeline (str) – specify the 10x Genomics package, one of: ‘cellranger’, ‘cellranger-atac’, ‘cellranger-arc’ (default is ‘cellranger’)
source (str) – either ‘count’ or ‘multi’ (default is ‘count’)
- Returns:
- exit code, non-zero values indicate problems
were encountered.
- Return type:
Integer
- auto_process_ngs.qc.utils.verify_qc(project, qc_dir=None, fastq_dir=None, qc_protocol=None, runner=None, log_dir=None)
Verify the QC run for a project
- Parameters:
project (AnalysisProject) – analysis project to verify the QC for
qc_dir (str) – optional, specify the subdir with the QC outputs being verified
fastq_dir (str) – optional, specify a non-default directory with Fastq files being verified
qc_protocol (str) – optional, QC protocol to verify against
runner (JobRunner) – optional, job runner to use for running the verification
log_dir (str) – optional, specify a directory to write logs to
- Returns:
- True if QC passes verification, otherwise
False.
- Return type:
Boolean