auto_process_ngs.qc.outputs

Provides utility classes and functions for QC outputs.

Provides the following classes:

  • QCOutputs: detect and characterise QC outputs

  • ExtraOutputs: helper class for reading ‘extra_outputs.tsv’ file

Provides the following functions:

  • fastq_screen_output: get names for fastq_screen outputs

  • fastqc_output: get names for FastQC outputs

  • fastq_strand_output: get name for fastq_strand.py output

  • picard_collect_insert_size_metrics_output: get names for Picard CollectInsertSizeMetrics output

  • rseqc_genebody_coverage_output: get names for RSeQC geneBody_coverage.py output

  • qualimap_rnaseq_output: get names for Qualimap ‘rnaseq’ output

  • cellranger_count_output: get names for cellranger count output

  • cellranger_atac_count_output: get names for cellranger-atac count output

  • cellranger_arc_count_output: get names for cellranger-arc count output

  • cellranger_multi_output: get names for cellranger multi output

  • check_fastq_strand_outputs: fetch Fastqs without fastq_strand.py outputs

  • check_cellranger_count_outputs: fetch sample names without cellranger count outputs

  • check_cellranger_atac_count_outputs: fetch sample names without cellranger-atac count outputs

  • check_cellranger_arc_count_outputs: fetch sample names without cellranger-arc count outputs

class auto_process_ngs.qc.outputs.ExtraOutputs(tsv_file)

Class for handling files specifying external QC outputs

Reads data from the supplied tab-delimited (TSV) file specifying one or more external QC output files.

Each line in the file should have up to three items separated by tabs:

  • file or directory (relative to the qc dir)

  • text description (used in HTML)

  • optionally, comma-separated list of additional files or directories to include in the final ZIP archive (relative to the qc dir)

Blank lines and lines starting with the ‘#’ comment character are ignored.

The data from each line of the file is then available via the ‘outputs’ attribute, which provides a list of ‘AttributeDictionary’ objects with the following properties:

  • ‘file_path’: relative path to the output file

  • ‘description’: associated description

  • ‘additional_files’: list of the associated files

Parameters:

tsv_file (str) – path to the input TSV file

class auto_process_ngs.qc.outputs.QCOutputs(qc_dir, fastq_attrs=None)

Class to detect and characterise QC outputs

On instantiation this class scans the supplied directory to identify and classify various artefacts which are typically produced by the QC pipeline.

The following attributes are available:

  • fastqs: sorted list of Fastq names

  • reads: list of reads (e.g. ‘r1’, ‘r2’, ‘i1’ etc)

  • samples: sorted list of sample names extracted from Fastqs

  • seq_data_samples: sorted list of samples with biological data (rather than e.g. feature barcodes)

  • bams: sorted list of BAM file names

  • organisms: sorted list of organism names

  • fastq_screens: sorted list of screen names

  • cellranger_references: sorted list of reference datasets used with 10x pipelines

  • cellranger_probe_sets: sorted list of probe set files used with 10x pipelines

  • multiplexed_samples: sorted list of sample names for multiplexed samples (e.g. 10x CellPlex)

  • outputs: list of QC output categories detected (see below for valid values)

  • output_files: list of absolute paths to QC output files

  • software: dictionary with information on the QC software packages

  • stats: AttrtibuteDictionary with useful stats from across the project

  • config_files: list of QC configuration files found in the QC directory (see below for valid values)

The following are valid values for the ‘outputs’ property:

  • ‘fastqc_[r1…]’

  • ‘screens_[r1…]’

  • ‘strandedness’

  • ‘sequence_lengths’

  • ‘picard_insert_size_metrics’

  • ‘rseqc_genebody_coverage’

  • ‘rseqc_infer_experiment’

  • ‘qualimap_rnaseq’

  • ‘icell8_stats’

  • ‘icell8_report’

  • ‘cellranger_count’

  • ‘cellranger_multi’

  • ‘cellranger-atac_count’

  • ‘cellranger-arc_count’

  • ‘multiqc’

  • ‘extra_outputs’

The following are valid values for the ‘config_files’ property:

  • fastq_strand.conf

  • 10x_multi_config.csv

  • 10x_multi_config.<SAMPLE>.csv

  • libraries.<SAMPLE>.csv

Parameters:
  • qc_dir (str) – path to directory to examine

  • fastq_attrs (BaseFastqAttrs) – (optional) class for extracting data from Fastq names

data(name)

Return the ‘raw’ data associated with a QC output

Parameters:

name (str) – name identifier for a QC output (e.g. ‘fastqc’)

Returns:

AttributeDictionary containing the raw data

associated with the named QC output.

Raises:

KeyError – if the name doesn’t match a stored QC output.

auto_process_ngs.qc.outputs.cellranger_arc_count_output(project, sample_name=None, prefix='cellranger_count')

Generate list of ‘cellranger-arc count’ outputs

Given an AnalysisProject, the outputs from ‘cellranger-arc count’ will look like:

  • {PREFIX}/{SAMPLE_n}/outs/summary.csv

  • {PREFIX}/{SAMPLE_n}/outs/web_summary.html

for each SAMPLE_n in the project.

If a sample name is supplied then outputs are limited to those for that sample

Parameters:
  • project (AnalysisProject) – project to generate output names for

  • sample_name (str) – sample to limit outputs to

  • prefix (str) – directory for outputs (defaults to “cellranger_count”)

Returns:

cellranger count outputs (without leading paths)

Return type:

tuple

auto_process_ngs.qc.outputs.cellranger_atac_count_output(project, sample_name=None, prefix='cellranger_count')

Generate list of ‘cellranger-atac count’ outputs

Given an AnalysisProject, the outputs from ‘cellranger-atac count’ will look like:

  • {PREFIX}/{SAMPLE_n}/outs/summary.csv

  • {PREFIX}/{SAMPLE_n}/outs/web_summary.html

for each SAMPLE_n in the project.

If a sample name is supplied then outputs are limited to those for that sample

Parameters:
  • project (AnalysisProject) – project to generate output names for

  • sample_name (str) – sample to limit outputs to

  • prefix (str) – directory for outputs (defaults to “cellranger_count”)

Returns:

cellranger count outputs (without leading paths)

Return type:

tuple

auto_process_ngs.qc.outputs.cellranger_count_output(project, sample_name=None, prefix='cellranger_count')

Generate list of ‘cellranger count’ outputs

Given an AnalysisProject, the outputs from ‘cellranger count’ will look like:

  • {PREFIX}/{SAMPLE_n}/outs/metrics_summary.csv

  • {PREFIX}/{SAMPLE_n}/outs/web_summary.html

for each SAMPLE_n in the project.

If a sample name is supplied then outputs are limited to those for that sample

Parameters:
  • project (AnalysisProject) – project to generate output names for

  • sample_name (str) – sample to limit outputs to

  • prefix (str) – directory for outputs (defaults to “cellranger_count”)

Returns:

cellranger count outputs (without leading paths)

Return type:

tuple

auto_process_ngs.qc.outputs.cellranger_multi_output(project, config_csv, sample_name=None, prefix='cellranger_multi')

Generate list of ‘cellranger multi’ outputs

Given an AnalysisProject, the outputs from ‘cellranger multi’ will look like:

  • {PREFIX}/outs/multi/multiplexing_analysis/tag_calls_summary.csv

and

  • {PREFIX}/outs/per_sample_outs/{SAMPLE_n}/metrics_summary.csv

  • {PREFIX}/outs/per_sample_outs/{SAMPLE_n}/web_summary.html

for each multiplexed SAMPLE_n defined in the config.csv file (nb these are not equivalent to the ‘samples’ defined by the Fastq files in the project).

If a sample name is supplied then outputs are limited to those for that sample; if the supplied config.csv file isn’t found then no outputs will be returned.

Parameters:
  • project (AnalysisProject) – project to generate output names for

  • config_csv (str) – path to the cellranger multi config.csv file

  • sample_name (str) – multiplexed sample to limit outputs to (optional)

  • prefix (str) – directory for outputs (optional, defaults to “cellranger_multi”)

Returns:

cellranger multi outputs (without leading paths)

Return type:

tuple

auto_process_ngs.qc.outputs.check_cellranger_arc_count_outputs(project, qc_dir=None, prefix='cellranger_count')

Return samples missing QC outputs from ‘cellranger-arc count’

Returns a list of the samples from a project for which one or more associated outputs from cellranger-arc count don’t exist in the specified QC directory.

Parameters:
  • project (AnalysisProject) – project to check the QC outputs for

  • qc_dir (str) – path to QC directory (if not the default QC directory for the project)

  • prefix (str) – directory for outputs (defaults to “cellranger_count”)

Returns:

list of sample names with missing outputs

Return type:

List

auto_process_ngs.qc.outputs.check_cellranger_atac_count_outputs(project, qc_dir=None, prefix='cellranger_count')

Return samples missing QC outputs from ‘cellranger-atac count’

Returns a list of the samples from a project for which one or more associated outputs from cellranger-atac count don’t exist in the specified QC directory.

Parameters:
  • project (AnalysisProject) – project to check the QC outputs for

  • qc_dir (str) – path to QC directory (if not the default QC directory for the project)

  • prefix (str) – directory for outputs (defaults to “cellranger_count”)

Returns:

list of sample names with missing outputs

Return type:

List

auto_process_ngs.qc.outputs.check_cellranger_count_outputs(project, qc_dir=None, prefix='cellranger_count')

Return samples missing QC outputs from ‘cellranger count’

Returns a list of the samples from a project for which one or more associated outputs from cellranger count don’t exist in the specified QC directory.

Parameters:
  • project (AnalysisProject) – project to check the QC outputs for

  • qc_dir (str) – path to QC directory (if not the default QC directory for the project)

  • prefix (str) – directory for outputs (defaults to “cellranger_count”)

Returns:

list of sample names with missing outputs

Return type:

List

auto_process_ngs.qc.outputs.check_fastq_screen_outputs(project, qc_dir, screen, fastqs=None, read_numbers=None, legacy=False)

Return Fastqs missing QC outputs from FastqScreen

Returns a list of the Fastqs from a project for which one or more associated outputs from FastqScreen don’t exist in the specified QC directory.

Parameters:
  • project (AnalysisProject) – project to check the QC outputs for

  • qc_dir (str) – path to the QC directory (relative path is assumed to be a subdirectory of the project)

  • screen (str) – screen name to check

  • fastqs (list) – optional list of Fastqs to check against (defaults to Fastqs from the project)

  • read_numbers (list) – read numbers to define Fastqs to predict outputs for; if not set then all non-index reads will be included

  • legacy (bool) – if True then check for ‘legacy’-style names (defult: False)

Returns:

list of Fastq files with missing outputs.

Return type:

List

auto_process_ngs.qc.outputs.check_fastq_strand_outputs(project, qc_dir, fastq_strand_conf, fastqs=None, read_numbers=None)

Return Fastqs missing QC outputs from fastq_strand.py

Returns a list of the Fastqs from a project for which one or more associated outputs from fastq_strand.py don’t exist in the specified QC directory.

Parameters:
  • project (AnalysisProject) – project to check the QC outputs for

  • qc_dir (str) – path to the QC directory (relative path is assumed to be a subdirectory of the project)

  • fastq_strand_conf (str) – path to a fastq_strand config file; strandedness QC outputs will be included unless the path is None or the config file doesn’t exist. Relative path is assumed to be a subdirectory of the project

  • fastqs (list) – optional list of Fastqs to check against (defaults to Fastqs from the project)

  • read_numbers (list) – read numbers to predict outputs for

Returns:

list of Fastq file “pairs” with missing

outputs; pairs are (R1,R2) tuples, with ‘R2’ missing if only one Fastq is used for the strandedness determination.

Return type:

List

auto_process_ngs.qc.outputs.check_fastqc_outputs(project, qc_dir, fastqs=None, read_numbers=None)

Return Fastqs missing QC outputs from FastQC

Returns a list of the Fastqs from a project for which one or more associated outputs from FastQC don’t exist in the specified QC directory.

Parameters:
  • project (AnalysisProject) – project to check the QC outputs for

  • qc_dir (str) – path to the QC directory (relative path is assumed to be a subdirectory of the project)

  • fastqs (list) – optional list of Fastqs to check against (defaults to Fastqs from the project)

  • read_numbers (list) – read numbers to predict outputs for

Returns:

list of Fastq files with missing outputs.

Return type:

List

auto_process_ngs.qc.outputs.fastq_screen_output(fastq, screen_name, legacy=False)

Generate name of fastq_screen output files

Given a Fastq file name and a screen name, the outputs from fastq_screen will look like:

  • {FASTQ}_screen_{SCREEN_NAME}.png

  • {FASTQ}_screen_{SCREEN_NAME}.txt

“Legacy” screen outputs look like:

  • {FASTQ}_{SCREEN_NAME}_screen.png

  • {FASTQ}_{SCREEN_NAME}_screen.txt

Parameters:
  • fastq (str) – name of Fastq file

  • screen_name (str) – name of screen

  • legacy (bool) – if True then use ‘legacy’ (old-style) naming convention (default: False)

Returns:

fastq_screen output names (without leading path)

Return type:

tuple

auto_process_ngs.qc.outputs.fastq_strand_output(fastq)

Generate name for fastq_strand.py output

Given a Fastq file name, the output from fastq_strand.py will look like:

  • {FASTQ}_fastq_strand.txt

Parameters:

fastq (str) – name of Fastq file

Returns:

fastq_strand.py output (without leading paths)

Return type:

tuple

auto_process_ngs.qc.outputs.fastqc_output(fastq)

Generate name of FastQC outputs

Given a Fastq file name, the outputs from FastQC will look like:

  • {FASTQ}_fastqc/

  • {FASTQ}_fastqc.html

  • {FASTQ}_fastqc.zip

Parameters:

fastq (str) – name of Fastq file

Returns:

FastQC outputs (without leading paths)

Return type:

tuple

auto_process_ngs.qc.outputs.picard_collect_insert_size_metrics_output(filen, prefix=None)

Generate names of Picard CollectInsertSizeMetrics output

Given a Fastq or BAM file name, the output from Picard’s CollectInsertSizeMetrics function will look like:

  • {PREFIX}/{FASTQ}.insert_size_metrics.txt

  • {PREFIX}/{FASTQ}.insert_size_histogram.pdf

Parameters:
  • filen (str) – name of Fastq or BAM file

  • prefix (str) – optional directory to prepend to outputs

Returns:

CollectInsertSizeMetrics output (without leading

paths)

Return type:

tuple

auto_process_ngs.qc.outputs.qualimap_rnaseq_output(prefix=None)

Generate names of Qualimap ‘rnaseq’ output

The output from Qualimap ‘rnaseq’ are always:

- {PREFIX}/qualimapReport.html
- {PREFIX}/rnaseq_qc_results.txt
- {PREFIX}/...
Parameters:

prefix (str) – optional directory to prepend to outputs

Returns:

Qualimap ‘rnaseq’ output (without leading paths)

Return type:

tuple

auto_process_ngs.qc.outputs.rseqc_genebody_coverage_output(name, prefix=None)

Generate names of RSeQC geneBody_coverage.py output

Given a basename, the output from geneBody_coverage.py will look like:

  • {PREFIX}/{NAME}.geneBodyCoverage.curves.png

  • {PREFIX}/{NAME}.geneBodyCoverage.r

  • {PREFIX}/{NAME}.geneBodyCoverage.txt

Parameters:
  • name (str) – basename for output files

  • prefix (str) – optional directory to prepend to outputs

Returns:

geneBody_coverage.py output (without leading paths)

Return type:

tuple