auto_process_ngs.qc.outputs
Provides utility classes and functions for QC outputs.
Provides the following classes:
QCOutputs: detect and characterise QC outputs
ExtraOutputs: helper class for reading ‘extra_outputs.tsv’ file
- class auto_process_ngs.qc.outputs.ExtraOutputs(tsv_file)
Class for handling files specifying external QC outputs
Reads data from the supplied tab-delimited (TSV) file specifying one or more external QC output files.
Each line in the file should have up to three items separated by tabs:
file or directory (relative to the qc dir)
text description (used in HTML)
optionally, comma-separated list of additional files or directories to include in the final ZIP archive (relative to the qc dir)
Blank lines and lines starting with the ‘#’ comment character are ignored.
The data from each line of the file is then available via the ‘outputs’ attribute, which provides a list of ‘AttributeDictionary’ objects with the following properties:
‘file_path’: relative path to the output file
‘description’: associated description
‘additional_files’: list of the associated files
- Parameters:
tsv_file (str) – path to the input TSV file
- class auto_process_ngs.qc.outputs.QCOutputs(qc_dir, fastq_attrs=None)
Class to detect and characterise QC outputs
On instantiation this class scans the supplied directory to identify and classify various artefacts which are typically produced by the QC pipeline.
The following attributes are available:
fastqs: sorted list of Fastq names
reads: list of reads (e.g. ‘r1’, ‘r2’, ‘i1’ etc)
samples: sorted list of sample names extracted from Fastqs
seq_data_samples: sorted list of samples with biological data (rather than e.g. feature barcodes)
bams: sorted list of BAM file names
organisms: sorted list of organism names
fastq_screens: sorted list of screen names
cellranger_references: sorted list of reference datasets used with 10x pipelines
cellranger_probe_sets: sorted list of probe set files used with 10x pipelines
multiplexed_samples: sorted list of sample names for multiplexed samples (e.g. 10x CellPlex)
physical_samples: sorted list of physical sample names for multiplexed datasets (e.g. 10x CellPlex)
outputs: list of QC output categories detected (see below for valid values)
output_files: list of absolute paths to QC output files
software: dictionary with information on the QC software packages
stats: AttrtibuteDictionary with useful stats from across the project
config_files: list of QC configuration files found in the QC directory (see below for valid values)
The following are valid values for the ‘outputs’ property:
‘fastqc_[r1…]’
‘screens_[r1…]’
‘strandedness’
‘sequence_lengths’
‘picard_insert_size_metrics’
‘rseqc_genebody_coverage’
‘rseqc_infer_experiment’
‘qualimap_rnaseq’
‘cellranger_count’
‘cellranger_multi’
‘cellranger-atac_count’
‘cellranger-arc_count’
‘multiqc’
‘extra_outputs’
The following are valid values for the ‘config_files’ property:
fastq_strand.conf
10x_multi_config.csv
10x_multi_config.<SAMPLE>.csv
libraries.<SAMPLE>.csv
- Parameters:
qc_dir (str) – path to directory to examine
fastq_attrs (BaseFastqAttrs) – (optional) class for extracting data from Fastq names
- data(name)
Return the ‘raw’ data associated with a QC output
- Parameters:
name (str) – name identifier for a QC output (e.g. ‘fastqc’)
- Returns:
- AttributeDictionary containing the raw data
associated with the named QC output.
- Raises:
KeyError – if the name doesn’t match a stored QC output.