auto_process_ngs.qc.seqlens
- class auto_process_ngs.qc.seqlens.SeqLens(json_file=None, data=None, fastq=None)
Wrapper class for handling sequence length data
The
SeqLens
class wraps data handling for sequence length data output from the ‘get_sequence_lengths’ function.- property data
Return the raw data dictionary
- property dist
Return the sequence length distribution
This is a dictionary where the keys are sequence lengths and the values are the corresponding number of sequences
- property fastq
Return the path to the associated Fastq file
- property frac_masked
Return the fraction of masked reads
- property frac_padded
Return the fraction of padded reads
- property masked_dist
Return the masked sequence length distribution
This is a dictionary where the keys are sequence lengths and the values are the corresponding number of masked sequences
- property max_length
Return the maximum sequence length
- property mean
Return the mean sequence length
- property min_length
Return the minimum sequence length
- property nmasked
Return the number of masked reads in the Fastq
- property npadded
Return the number of padded reads in the Fastq
- property nreads
Return the number of reads in the Fastq
- property range
Return the range of sequence lengths as a string
- auto_process_ngs.qc.seqlens.get_sequence_lengths(fastq, outfile=None, show_progress=False, limit=None)
Get sequence lengths and masking statistics for Fastq
Returns a dictionary with the following keys:
fastq: the Fastq file that metrics were calculated from
nreads: total number of reads processed
nreads_masked: number of reads that are completely masked (i.e. consist only of ‘N’s)
nreads_padded: number of partially masked reads (i.e. contain trailing ‘N’s)
frac_reads_masked: fraction of the processed reads which are masked
frac_reads_padded: fraction of the processed reads which are padded
min_length: minimum read length
max_length: maximum read length
mean_length: mean read length
median_length: median read length
seq_lengths_dist: distribution of lengths for all reads
seq_lengths_masked_dist: distribution of lengths for masked reads
seq_lengths_padded_dist: distribution of lengths for padded reads
The distributions are each themselves dictionaries where the keys are read lengths and the values are the number of reads with the matching length; note that only lengths with a non-zero number of reads are included (zeroes are implied for all other lengths).
- Parameters:
fastq (str) – path to Fastq file
outfile (str) – optional, path to output JSON file
show_progress (bool) – if True then print message to stdout every 100000 reads indicating progress (default: operate silently)
limit (int) – if set then only process this number of reads from the head of the Fastq and return stats based on these (default: process all reads in the file)
- Returns:
containing the metrics for the Fastq.
- Return type:
Dictionary