auto_process_ngs.qc.seqlens

class auto_process_ngs.qc.seqlens.SeqLens(json_file=None, data=None, fastq=None)

Wrapper class for handling sequence length data

The SeqLens class wraps data handling for sequence length data output from the ‘get_sequence_lengths’ function.

property data

Return the raw data dictionary

property dist

Return the sequence length distribution

This is a dictionary where the keys are sequence lengths and the values are the corresponding number of sequences

property fastq

Return the path to the associated Fastq file

property frac_masked

Return the fraction of masked reads

property frac_padded

Return the fraction of padded reads

property masked_dist

Return the masked sequence length distribution

This is a dictionary where the keys are sequence lengths and the values are the corresponding number of masked sequences

property max_length

Return the maximum sequence length

property mean

Return the mean sequence length

property min_length

Return the minimum sequence length

property nmasked

Return the number of masked reads in the Fastq

property npadded

Return the number of padded reads in the Fastq

property nreads

Return the number of reads in the Fastq

property range

Return the range of sequence lengths as a string

auto_process_ngs.qc.seqlens.get_sequence_lengths(fastq, outfile=None, show_progress=False, limit=None)

Get sequence lengths and masking statistics for Fastq

Returns a dictionary with the following keys:

  • fastq: the Fastq file that metrics were calculated from

  • nreads: total number of reads processed

  • nreads_masked: number of reads that are completely masked (i.e. consist only of ‘N’s)

  • nreads_padded: number of partially masked reads (i.e. contain trailing ‘N’s)

  • frac_reads_masked: fraction of the processed reads which are masked

  • frac_reads_padded: fraction of the processed reads which are padded

  • min_length: minimum read length

  • max_length: maximum read length

  • mean_length: mean read length

  • median_length: median read length

  • seq_lengths_dist: distribution of lengths for all reads

  • seq_lengths_masked_dist: distribution of lengths for masked reads

  • seq_lengths_padded_dist: distribution of lengths for padded reads

The distributions are each themselves dictionaries where the keys are read lengths and the values are the number of reads with the matching length; note that only lengths with a non-zero number of reads are included (zeroes are implied for all other lengths).

Parameters:
  • fastq (str) – path to Fastq file

  • outfile (str) – optional, path to output JSON file

  • show_progress (bool) – if True then print message to stdout every 100000 reads indicating progress (default: operate silently)

  • limit (int) – if set then only process this number of reads from the head of the Fastq and return stats based on these (default: process all reads in the file)

Returns:

containing the metrics for the Fastq.

Return type:

Dictionary