auto_process_ngs.qc.outputs

Provides utility classes and functions for QC outputs.

Provides the following classes:

  • QCOutputs: detect and characterise QC outputs

  • ExtraOutputs: helper class for reading ‘extra_outputs.tsv’ file

class auto_process_ngs.qc.outputs.ExtraOutputs(tsv_file)

Class for handling files specifying external QC outputs

Reads data from the supplied tab-delimited (TSV) file specifying one or more external QC output files.

Each line in the file should have up to three items separated by tabs:

  • file or directory (relative to the qc dir)

  • text description (used in HTML)

  • optionally, comma-separated list of additional files or directories to include in the final ZIP archive (relative to the qc dir)

Blank lines and lines starting with the ‘#’ comment character are ignored.

The data from each line of the file is then available via the ‘outputs’ attribute, which provides a list of ‘AttributeDictionary’ objects with the following properties:

  • ‘file_path’: relative path to the output file

  • ‘description’: associated description

  • ‘additional_files’: list of the associated files

Parameters:

tsv_file (str) – path to the input TSV file

class auto_process_ngs.qc.outputs.QCOutputs(qc_dir, fastq_attrs=None)

Class to detect and characterise QC outputs

On instantiation this class scans the supplied directory to identify and classify various artefacts which are typically produced by the QC pipeline.

The following attributes are available:

  • fastqs: sorted list of Fastq names

  • reads: list of reads (e.g. ‘r1’, ‘r2’, ‘i1’ etc)

  • samples: sorted list of sample names extracted from Fastqs

  • seq_data_samples: sorted list of samples with biological data (rather than e.g. feature barcodes)

  • bams: sorted list of BAM file names

  • organisms: sorted list of organism names

  • fastq_screens: sorted list of screen names

  • cellranger_references: sorted list of reference datasets used with 10x pipelines

  • cellranger_probe_sets: sorted list of probe set files used with 10x pipelines

  • multiplexed_samples: sorted list of sample names for multiplexed samples (e.g. 10x CellPlex)

  • physical_samples: sorted list of physical sample names for multiplexed datasets (e.g. 10x CellPlex)

  • outputs: list of QC output categories detected (see below for valid values)

  • output_files: list of absolute paths to QC output files

  • software: dictionary with information on the QC software packages

  • stats: AttrtibuteDictionary with useful stats from across the project

  • config_files: list of QC configuration files found in the QC directory (see below for valid values)

The following are valid values for the ‘outputs’ property:

  • ‘fastqc_[r1…]’

  • ‘screens_[r1…]’

  • ‘strandedness’

  • ‘sequence_lengths’

  • ‘picard_insert_size_metrics’

  • ‘rseqc_genebody_coverage’

  • ‘rseqc_infer_experiment’

  • ‘qualimap_rnaseq’

  • ‘cellranger_count’

  • ‘cellranger_multi’

  • ‘cellranger-atac_count’

  • ‘cellranger-arc_count’

  • ‘multiqc’

  • ‘extra_outputs’

The following are valid values for the ‘config_files’ property:

  • fastq_strand.conf

  • 10x_multi_config.csv

  • 10x_multi_config.<SAMPLE>.csv

  • libraries.<SAMPLE>.csv

Parameters:
  • qc_dir (str) – path to directory to examine

  • fastq_attrs (BaseFastqAttrs) – (optional) class for extracting data from Fastq names

data(name)

Return the ‘raw’ data associated with a QC output

Parameters:

name (str) – name identifier for a QC output (e.g. ‘fastqc’)

Returns:

AttributeDictionary containing the raw data

associated with the named QC output.

Raises:

KeyError – if the name doesn’t match a stored QC output.