`auto_process_ngs.qc.seqlens`

class auto_process_ngs.qc.seqlens.SeqLens(json_file=None, data=None, fastq=None)

Wrapper class for handling sequence length data

The SeqLens class wraps data handling for sequence length data output from the ‘get_sequence_lengths’ function.

property data: Return the raw data dictionary

property dist

Return the sequence length distribution

This is a dictionary where the keys are sequence lengths and the values are the corresponding number of sequences

property fastq: Return the path to the associated Fastq file

property frac_masked: Return the fraction of masked reads

property frac_padded: Return the fraction of padded reads

property masked_dist

Return the masked sequence length distribution

This is a dictionary where the keys are sequence lengths and the values are the corresponding number of masked sequences

property max_length: Return the maximum sequence length

property mean: Return the mean sequence length

property min_length: Return the minimum sequence length

property nmasked: Return the number of masked reads in the Fastq

property npadded: Return the number of padded reads in the Fastq

property nreads: Return the number of reads in the Fastq

property range: Return the range of sequence lengths as a string

auto_process_ngs.qc.seqlens.get_sequence_lengths(fastq, outfile=None, show_progress=False, limit=None)

Get sequence lengths and masking statistics for Fastq

Returns a dictionary with the following keys:

fastq: the Fastq file that metrics were calculated from
nreads: total number of reads processed
nreads_masked: number of reads that are completely masked (i.e. consist only of ‘N’s)
nreads_padded: number of partially masked reads (i.e. contain trailing ‘N’s)
frac_reads_masked: fraction of the processed reads which are masked
frac_reads_padded: fraction of the processed reads which are padded
min_length: minimum read length
max_length: maximum read length
mean_length: mean read length
median_length: median read length
seq_lengths_dist: distribution of lengths for all reads
seq_lengths_masked_dist: distribution of lengths for masked reads
seq_lengths_padded_dist: distribution of lengths for padded reads

The distributions are each themselves dictionaries where the keys are read lengths and the values are the number of reads with the matching length; note that only lengths with a non-zero number of reads are included (zeroes are implied for all other lengths).

Parameters:

fastq (str) – path to Fastq file
outfile (str) – optional, path to output JSON file
show_progress (bool) – if True then print message to stdout every 100000 reads indicating progress (default: operate silently)
limit (int) – if set then only process this number of reads from the head of the Fastq and return stats based on these (default: process all reads in the file)

Returns:

containing the metrics for the Fastq.

Return type:

Dictionary

auto_process_ngs.qc.seqlens

`auto_process_ngs.qc.seqlens`