auto_process_ngs.barcodes.splitter
Classes and functions for sorting reads from FASTQ files by barcode (i.e. index sequences) from the read headers:
HammingMetrics: class implementing Hamming distance calculators
HammingLookup: class for fetching Hamming distance for two strings
get_fastqs_from_dir: list Fastqs matching specific lane
split_single_end: split reads from single ended data
split_paired_end: split reads from pair ended data
- class auto_process_ngs.barcodes.splitter.BarcodeMatcher(index_seqs, max_dist=0)
Class to match barcode sequences against reference set
- match(seq)
Find matching index sequence for supplied sequence
- Parameters:
seq (str) – sequence to match against indexes
- Returns:
- matching index sequence, or None if no match
was found.
- Return type:
String
- property sequences
Return list of stored index sequences
- class auto_process_ngs.barcodes.splitter.HammingLookup(hamming_func=<bound method HammingMetrics.hamming_distance of <class 'auto_process_ngs.barcodes.splitter.HammingMetrics'>>)
Class for handling Hamming distances for multiple sequences
- dist(s1, s2)
Return the Hamming distance for two sequences
If the sequences are new then calculates, caches and returns the Hamming distance for this pair (calculated using the function supplied on instantiation); otherwise returns the cached distance calculated previously.
- Parameters:
s1 (str) – first sequence
s2 (str) – second sequence
- Returns:
Hamming distance for the two sequences.
- Return type:
Integer
- class auto_process_ngs.barcodes.splitter.HammingMetrics
Calculate Hamming distances between two strings
Implements a set of functions (as class methods) fo calculating Hamming distances between two strings (sequences):
hamming_distance: for equal-length sequences
hamming_distance_truncate: for non-equal-length sequences
hamming_distance_with_N: for equal-length sequences where N’s automatically mismatch
For background on Hamming distance see e.g.:
http://en.wikipedia.org/wiki/Hamming_distance
- classmethod hamming_distance(s1, s2)
Get Hamming distance for equal-length sequences
- classmethod hamming_distance_truncate(s1, s2)
Get Hamming distance for non-equal-length sequences
- classmethod hamming_distance_with_N(s1, s2)
Get Hamming distance for equal-length sequences (no match for Ns)
- auto_process_ngs.barcodes.splitter.get_fastqs_from_dir(dirn, lane, unaligned_dir=None)
Collect Fastq files for specified lane
- Parameters:
dirn (str) – path to directory to collect Fastq files from
lane (int) – lane Fastqs must have come from
unaligned_dir (str) – subdirectory of ‘dirn’ with outputs from bcl2fastq
- Returns:
- list of Fastqs (for single ended data) or of
Fastq pairs (for pair ended data).
- Return type:
- auto_process_ngs.barcodes.splitter.split_paired_end(matcher, fastq_pairs, base_name=None, output_dir=None)
Split reads from paired end data
For each fastq file pair in ‘fastqs’, check reads against the index sequences in the BarcodeMatcher ‘matcher’ and write to an appropriate file.
- Parameters:
matcher (BarcodeMatcher) – barcoder matcher instance
fastqs (list) – list of Fastq pairs to split
base_name (str) – optional, base name to use for output Fastq files
output_dir (str) – optional, path to directory to write output Fastqs to
- auto_process_ngs.barcodes.splitter.split_single_end(matcher, fastqs, base_name=None, output_dir=None)
Split reads from single ended data
For each fastq file in ‘fastqs’, check reads against the index sequences in the BarcodeMatcher ‘matcher’ and write to an appropriate file.
- Parameters:
matcher (BarcodeMatcher) – barcoder matcher instance
fastqs (list) – list of Fastqs to split
base_name (str) – optional, base name to use for output Fastq files
output_dir (str) – optional, path to directory to write output Fastqs to