auto_process_ngs.qc.outputs
Provides utility classes and functions for QC outputs.
Provides the following classes:
QCOutputs: detect and characterise QC outputs
ExtraOutputs: helper class for reading ‘extra_outputs.tsv’ file
Provides the following functions:
fastq_screen_output: get names for fastq_screen outputs
fastqc_output: get names for FastQC outputs
fastq_strand_output: get name for fastq_strand.py output
picard_collect_insert_size_metrics_output: get names for Picard CollectInsertSizeMetrics output
rseqc_genebody_coverage_output: get names for RSeQC geneBody_coverage.py output
qualimap_rnaseq_output: get names for Qualimap ‘rnaseq’ output
cellranger_count_output: get names for cellranger count output
cellranger_atac_count_output: get names for cellranger-atac count output
cellranger_arc_count_output: get names for cellranger-arc count output
cellranger_multi_output: get names for cellranger multi output
check_fastq_strand_outputs: fetch Fastqs without fastq_strand.py outputs
check_cellranger_count_outputs: fetch sample names without cellranger count outputs
check_cellranger_atac_count_outputs: fetch sample names without cellranger-atac count outputs
check_cellranger_arc_count_outputs: fetch sample names without cellranger-arc count outputs
- class auto_process_ngs.qc.outputs.ExtraOutputs(tsv_file)
Class for handling files specifying external QC outputs
Reads data from the supplied tab-delimited (TSV) file specifying one or more external QC output files.
Each line in the file should have up to three items separated by tabs:
file or directory (relative to the qc dir)
text description (used in HTML)
optionally, comma-separated list of additional files or directories to include in the final ZIP archive (relative to the qc dir)
Blank lines and lines starting with the ‘#’ comment character are ignored.
The data from each line of the file is then available via the ‘outputs’ attribute, which provides a list of ‘AttributeDictionary’ objects with the following properties:
‘file_path’: relative path to the output file
‘description’: associated description
‘additional_files’: list of the associated files
- Parameters:
tsv_file (str) – path to the input TSV file
- class auto_process_ngs.qc.outputs.QCOutputs(qc_dir, fastq_attrs=None)
Class to detect and characterise QC outputs
On instantiation this class scans the supplied directory to identify and classify various artefacts which are typically produced by the QC pipeline.
The following attributes are available:
fastqs: sorted list of Fastq names
reads: list of reads (e.g. ‘r1’, ‘r2’, ‘i1’ etc)
samples: sorted list of sample names extracted from Fastqs
seq_data_samples: sorted list of samples with biological data (rather than e.g. feature barcodes)
bams: sorted list of BAM file names
organisms: sorted list of organism names
fastq_screens: sorted list of screen names
cellranger_references: sorted list of reference datasets used with 10x pipelines
cellranger_probe_sets: sorted list of probe set files used with 10x pipelines
multiplexed_samples: sorted list of sample names for multiplexed samples (e.g. 10x CellPlex)
outputs: list of QC output categories detected (see below for valid values)
output_files: list of absolute paths to QC output files
software: dictionary with information on the QC software packages
stats: AttrtibuteDictionary with useful stats from across the project
config_files: list of QC configuration files found in the QC directory (see below for valid values)
The following are valid values for the ‘outputs’ property:
‘fastqc_[r1…]’
‘screens_[r1…]’
‘strandedness’
‘sequence_lengths’
‘picard_insert_size_metrics’
‘rseqc_genebody_coverage’
‘rseqc_infer_experiment’
‘qualimap_rnaseq’
‘icell8_stats’
‘icell8_report’
‘cellranger_count’
‘cellranger_multi’
‘cellranger-atac_count’
‘cellranger-arc_count’
‘multiqc’
‘extra_outputs’
The following are valid values for the ‘config_files’ property:
fastq_strand.conf
10x_multi_config.csv
10x_multi_config.<SAMPLE>.csv
libraries.<SAMPLE>.csv
- Parameters:
qc_dir (str) – path to directory to examine
fastq_attrs (BaseFastqAttrs) – (optional) class for extracting data from Fastq names
- data(name)
Return the ‘raw’ data associated with a QC output
- Parameters:
name (str) – name identifier for a QC output (e.g. ‘fastqc’)
- Returns:
- AttributeDictionary containing the raw data
associated with the named QC output.
- Raises:
KeyError – if the name doesn’t match a stored QC output.
- auto_process_ngs.qc.outputs.cellranger_arc_count_output(project, sample_name=None, prefix='cellranger_count')
Generate list of ‘cellranger-arc count’ outputs
Given an AnalysisProject, the outputs from ‘cellranger-arc count’ will look like:
{PREFIX}/{SAMPLE_n}/outs/summary.csv
{PREFIX}/{SAMPLE_n}/outs/web_summary.html
for each SAMPLE_n in the project.
If a sample name is supplied then outputs are limited to those for that sample
- Parameters:
project (AnalysisProject) – project to generate output names for
sample_name (str) – sample to limit outputs to
prefix (str) – directory for outputs (defaults to “cellranger_count”)
- Returns:
cellranger count outputs (without leading paths)
- Return type:
tuple
- auto_process_ngs.qc.outputs.cellranger_atac_count_output(project, sample_name=None, prefix='cellranger_count')
Generate list of ‘cellranger-atac count’ outputs
Given an AnalysisProject, the outputs from ‘cellranger-atac count’ will look like:
{PREFIX}/{SAMPLE_n}/outs/summary.csv
{PREFIX}/{SAMPLE_n}/outs/web_summary.html
for each SAMPLE_n in the project.
If a sample name is supplied then outputs are limited to those for that sample
- Parameters:
project (AnalysisProject) – project to generate output names for
sample_name (str) – sample to limit outputs to
prefix (str) – directory for outputs (defaults to “cellranger_count”)
- Returns:
cellranger count outputs (without leading paths)
- Return type:
tuple
- auto_process_ngs.qc.outputs.cellranger_count_output(project, sample_name=None, prefix='cellranger_count')
Generate list of ‘cellranger count’ outputs
Given an AnalysisProject, the outputs from ‘cellranger count’ will look like:
{PREFIX}/{SAMPLE_n}/outs/metrics_summary.csv
{PREFIX}/{SAMPLE_n}/outs/web_summary.html
for each SAMPLE_n in the project.
If a sample name is supplied then outputs are limited to those for that sample
- Parameters:
project (AnalysisProject) – project to generate output names for
sample_name (str) – sample to limit outputs to
prefix (str) – directory for outputs (defaults to “cellranger_count”)
- Returns:
cellranger count outputs (without leading paths)
- Return type:
tuple
- auto_process_ngs.qc.outputs.cellranger_multi_output(project, config_csv, sample_name=None, prefix='cellranger_multi')
Generate list of ‘cellranger multi’ outputs
Given an AnalysisProject, the outputs from ‘cellranger multi’ will look like:
{PREFIX}/outs/multi/multiplexing_analysis/tag_calls_summary.csv
and
{PREFIX}/outs/per_sample_outs/{SAMPLE_n}/metrics_summary.csv
{PREFIX}/outs/per_sample_outs/{SAMPLE_n}/web_summary.html
for each multiplexed SAMPLE_n defined in the config.csv file (nb these are not equivalent to the ‘samples’ defined by the Fastq files in the project).
If a sample name is supplied then outputs are limited to those for that sample; if the supplied config.csv file isn’t found then no outputs will be returned.
- Parameters:
project (AnalysisProject) – project to generate output names for
config_csv (str) – path to the cellranger multi config.csv file
sample_name (str) – multiplexed sample to limit outputs to (optional)
prefix (str) – directory for outputs (optional, defaults to “cellranger_multi”)
- Returns:
cellranger multi outputs (without leading paths)
- Return type:
tuple
- auto_process_ngs.qc.outputs.check_cellranger_arc_count_outputs(project, qc_dir=None, prefix='cellranger_count')
Return samples missing QC outputs from ‘cellranger-arc count’
Returns a list of the samples from a project for which one or more associated outputs from cellranger-arc count don’t exist in the specified QC directory.
- Parameters:
project (AnalysisProject) – project to check the QC outputs for
qc_dir (str) – path to QC directory (if not the default QC directory for the project)
prefix (str) – directory for outputs (defaults to “cellranger_count”)
- Returns:
list of sample names with missing outputs
- Return type:
- auto_process_ngs.qc.outputs.check_cellranger_atac_count_outputs(project, qc_dir=None, prefix='cellranger_count')
Return samples missing QC outputs from ‘cellranger-atac count’
Returns a list of the samples from a project for which one or more associated outputs from cellranger-atac count don’t exist in the specified QC directory.
- Parameters:
project (AnalysisProject) – project to check the QC outputs for
qc_dir (str) – path to QC directory (if not the default QC directory for the project)
prefix (str) – directory for outputs (defaults to “cellranger_count”)
- Returns:
list of sample names with missing outputs
- Return type:
- auto_process_ngs.qc.outputs.check_cellranger_count_outputs(project, qc_dir=None, prefix='cellranger_count')
Return samples missing QC outputs from ‘cellranger count’
Returns a list of the samples from a project for which one or more associated outputs from cellranger count don’t exist in the specified QC directory.
- Parameters:
project (AnalysisProject) – project to check the QC outputs for
qc_dir (str) – path to QC directory (if not the default QC directory for the project)
prefix (str) – directory for outputs (defaults to “cellranger_count”)
- Returns:
list of sample names with missing outputs
- Return type:
- auto_process_ngs.qc.outputs.check_fastq_screen_outputs(project, qc_dir, screen, fastqs=None, read_numbers=None, legacy=False)
Return Fastqs missing QC outputs from FastqScreen
Returns a list of the Fastqs from a project for which one or more associated outputs from FastqScreen don’t exist in the specified QC directory.
- Parameters:
project (AnalysisProject) – project to check the QC outputs for
qc_dir (str) – path to the QC directory (relative path is assumed to be a subdirectory of the project)
screen (str) – screen name to check
fastqs (list) – optional list of Fastqs to check against (defaults to Fastqs from the project)
read_numbers (list) – read numbers to define Fastqs to predict outputs for; if not set then all non-index reads will be included
legacy (bool) – if True then check for ‘legacy’-style names (defult: False)
- Returns:
list of Fastq files with missing outputs.
- Return type:
- auto_process_ngs.qc.outputs.check_fastq_strand_outputs(project, qc_dir, fastq_strand_conf, fastqs=None, read_numbers=None)
Return Fastqs missing QC outputs from fastq_strand.py
Returns a list of the Fastqs from a project for which one or more associated outputs from fastq_strand.py don’t exist in the specified QC directory.
- Parameters:
project (AnalysisProject) – project to check the QC outputs for
qc_dir (str) – path to the QC directory (relative path is assumed to be a subdirectory of the project)
fastq_strand_conf (str) – path to a fastq_strand config file; strandedness QC outputs will be included unless the path is None or the config file doesn’t exist. Relative path is assumed to be a subdirectory of the project
fastqs (list) – optional list of Fastqs to check against (defaults to Fastqs from the project)
read_numbers (list) – read numbers to predict outputs for
- Returns:
- list of Fastq file “pairs” with missing
outputs; pairs are (R1,R2) tuples, with ‘R2’ missing if only one Fastq is used for the strandedness determination.
- Return type:
- auto_process_ngs.qc.outputs.check_fastqc_outputs(project, qc_dir, fastqs=None, read_numbers=None)
Return Fastqs missing QC outputs from FastQC
Returns a list of the Fastqs from a project for which one or more associated outputs from FastQC don’t exist in the specified QC directory.
- Parameters:
project (AnalysisProject) – project to check the QC outputs for
qc_dir (str) – path to the QC directory (relative path is assumed to be a subdirectory of the project)
fastqs (list) – optional list of Fastqs to check against (defaults to Fastqs from the project)
read_numbers (list) – read numbers to predict outputs for
- Returns:
list of Fastq files with missing outputs.
- Return type:
- auto_process_ngs.qc.outputs.fastq_screen_output(fastq, screen_name, legacy=False)
Generate name of fastq_screen output files
Given a Fastq file name and a screen name, the outputs from fastq_screen will look like:
{FASTQ}_screen_{SCREEN_NAME}.png
{FASTQ}_screen_{SCREEN_NAME}.txt
“Legacy” screen outputs look like:
{FASTQ}_{SCREEN_NAME}_screen.png
{FASTQ}_{SCREEN_NAME}_screen.txt
- Parameters:
fastq (str) – name of Fastq file
screen_name (str) – name of screen
legacy (bool) – if True then use ‘legacy’ (old-style) naming convention (default: False)
- Returns:
fastq_screen output names (without leading path)
- Return type:
tuple
- auto_process_ngs.qc.outputs.fastq_strand_output(fastq)
Generate name for fastq_strand.py output
Given a Fastq file name, the output from fastq_strand.py will look like:
{FASTQ}_fastq_strand.txt
- Parameters:
fastq (str) – name of Fastq file
- Returns:
fastq_strand.py output (without leading paths)
- Return type:
tuple
- auto_process_ngs.qc.outputs.fastqc_output(fastq)
Generate name of FastQC outputs
Given a Fastq file name, the outputs from FastQC will look like:
{FASTQ}_fastqc/
{FASTQ}_fastqc.html
{FASTQ}_fastqc.zip
- Parameters:
fastq (str) – name of Fastq file
- Returns:
FastQC outputs (without leading paths)
- Return type:
tuple
- auto_process_ngs.qc.outputs.picard_collect_insert_size_metrics_output(filen, prefix=None)
Generate names of Picard CollectInsertSizeMetrics output
Given a Fastq or BAM file name, the output from Picard’s CollectInsertSizeMetrics function will look like:
{PREFIX}/{FASTQ}.insert_size_metrics.txt
{PREFIX}/{FASTQ}.insert_size_histogram.pdf
- Parameters:
filen (str) – name of Fastq or BAM file
prefix (str) – optional directory to prepend to outputs
- Returns:
- CollectInsertSizeMetrics output (without leading
paths)
- Return type:
tuple
- auto_process_ngs.qc.outputs.qualimap_rnaseq_output(prefix=None)
Generate names of Qualimap ‘rnaseq’ output
The output from Qualimap ‘rnaseq’ are always:
- {PREFIX}/qualimapReport.html - {PREFIX}/rnaseq_qc_results.txt - {PREFIX}/...
- Parameters:
prefix (str) – optional directory to prepend to outputs
- Returns:
Qualimap ‘rnaseq’ output (without leading paths)
- Return type:
tuple
- auto_process_ngs.qc.outputs.rseqc_genebody_coverage_output(name, prefix=None)
Generate names of RSeQC geneBody_coverage.py output
Given a basename, the output from geneBody_coverage.py will look like:
{PREFIX}/{NAME}.geneBodyCoverage.curves.png
{PREFIX}/{NAME}.geneBodyCoverage.r
{PREFIX}/{NAME}.geneBodyCoverage.txt
- Parameters:
name (str) – basename for output files
prefix (str) – optional directory to prepend to outputs
- Returns:
geneBody_coverage.py output (without leading paths)
- Return type:
tuple