auto_process_ngs.stats

stats.py

Classes and functions for collecting and reporting statistics for a run:

  • FastqStatistics: collects and reports stats on FASTQs from an Illumina sequencing run

  • FastqStats: container for storing data about a FASTQ file

  • collect_fastq_data: collect data from FASTQ file in a FastqStats instance

class auto_process_ngs.stats.FastqStatistics(illumina_data, n_processors=1, add_to=None)

Class for collecting and reporting stats on Illumina FASTQs

Given a directory with fastq(.gz) files arranged in the same structure as the output from bcl2fastq or bcl2fastq2, collects statistics for each file and provides methods for reporting different aspects.

Example usage:

>>> from IlluminaData import IlluminaData
>>> data = IlluminaData('120117_BLAH_JSHJHXXX','bcl2fastq')
>>> stats = FastqStatistics(data)
>>> stats.report_basic_stats('basic_stats.out')
property lane_names

Return list of lane names (e.g. [‘L1’,’L2’,…])

property raw

Return the ‘raw’ statistics TabFile instance

report_basic_stats(out_file=None, fp=None)

Report the ‘basic’ statistics

For each FASTQ file, report the following information:

  • Project name

  • Sample name

  • FASTQ file name (without leading directory)

  • Size (human-readable)

  • Nreads (number of reads)

  • Paired_end (‘Y’ for paired-end, ‘N’ for single-end)

Parameters:
  • out_file (str) – name of file to write report to (used if ‘fp’ is not supplied)

  • fp (File) – File-like object open for writing (defaults to stdout if ‘out_file’ also not supplied)

report_full_stats(out_file=None, fp=None)

Report all statistics gathered for all FASTQs

Essentially a dump of all the data.

Parameters:
  • out_file (str) – name of file to write report to (used if ‘fp’ is not supplied)

  • fp (File) – File-like object open for writing (defaults to stdout if ‘out_file’ also not supplied)

report_per_lane_sample_stats(out_file=None, fp=None, samplesheet=None)

Report of reads per sample in each lane

Reports the number of reads for each sample in each lane plus the total reads for each lane.

Example output:

Lane 1 Total reads = 182851745 - KatyDobbs/KD-K1 79888058 43.7% - KatyDobbs/KD-K3 97854292 53.5% - Undetermined_indices/lane1 5109395 2.8% …

Parameters:
  • out_file (str) – name of file to write report to (used if ‘fp’ is not supplied)

  • fp (File) – File-like object open for writing (defaults to stdout if ‘out_file’ also not supplied)

  • samplesheet (str) – optional sample sheet file to get additional data from

report_per_lane_summary_stats(out_file=None, fp=None)

Report summary of total and unassigned reads per-lane

Parameters:
  • out_file (str) – name of file to write report to (used if ‘fp’ is not supplied)

  • fp (File) – File-like object open for writing (defaults to stdout if ‘out_file’ also not supplied)

class auto_process_ngs.stats.FastqStats(fastq, project, sample)

Container for storing data about a FASTQ file

This is a convenience wrapper for holding together data for a FASTQ file (full path, associated project and sample names, number of reads and filesize).

property lanes

Lane numbers associated with the FASTQ file

property name

FASTQ file name without leading directory

property read_number

Read number extracted from the FASTQ name

auto_process_ngs.stats.collect_fastq_data(fqstats)

Collect data from FASTQ file in a FastqStats instance

Given a FastqStats instance, collects and sets the following properties derived from the corresponding FASTQ file stored in that instance:

  • nreads: total number of reads

  • fsize: file size

  • reads_by_lane: (R1 FASTQs only) dictionary where keys are lane numbers and values are read counts

Note that if the FASTQ file is an R2 (or higher) file then the reads per lane will not be set.

Parameters:

fqstats (FastqStats) – FastqStats instance

Returns:

input FastqStats instance with the

appropriated properties updated.

Return type:

FastqStats