auto_process_ngs.barcodes.splitter

Classes and functions for sorting reads from FASTQ files by barcode (i.e. index sequences) from the read headers:

  • HammingMetrics: class implementing Hamming distance calculators

  • HammingLookup: class for fetching Hamming distance for two strings

  • get_fastqs_from_dir: list Fastqs matching specific lane

  • split_single_end: split reads from single ended data

  • split_paired_end: split reads from pair ended data

class auto_process_ngs.barcodes.splitter.BarcodeMatcher(index_seqs, max_dist=0)

Class to match barcode sequences against reference set

match(seq)

Find matching index sequence for supplied sequence

Parameters:

seq (str) – sequence to match against indexes

Returns:

matching index sequence, or None if no match

was found.

Return type:

String

property sequences

Return list of stored index sequences

class auto_process_ngs.barcodes.splitter.HammingLookup(hamming_func=<bound method HammingMetrics.hamming_distance of <class 'auto_process_ngs.barcodes.splitter.HammingMetrics'>>)

Class for handling Hamming distances for multiple sequences

dist(s1, s2)

Return the Hamming distance for two sequences

If the sequences are new then calculates, caches and returns the Hamming distance for this pair (calculated using the function supplied on instantiation); otherwise returns the cached distance calculated previously.

Parameters:
  • s1 (str) – first sequence

  • s2 (str) – second sequence

Returns:

Hamming distance for the two sequences.

Return type:

Integer

class auto_process_ngs.barcodes.splitter.HammingMetrics

Calculate Hamming distances between two strings

Implements a set of functions (as class methods) fo calculating Hamming distances between two strings (sequences):

  • hamming_distance: for equal-length sequences

  • hamming_distance_truncate: for non-equal-length sequences

  • hamming_distance_with_N: for equal-length sequences where N’s automatically mismatch

For background on Hamming distance see e.g.:

http://en.wikipedia.org/wiki/Hamming_distance

classmethod hamming_distance(s1, s2)

Get Hamming distance for equal-length sequences

classmethod hamming_distance_truncate(s1, s2)

Get Hamming distance for non-equal-length sequences

classmethod hamming_distance_with_N(s1, s2)

Get Hamming distance for equal-length sequences (no match for Ns)

auto_process_ngs.barcodes.splitter.get_fastqs_from_dir(dirn, lane, unaligned_dir=None)

Collect Fastq files for specified lane

Parameters:
  • dirn (str) – path to directory to collect Fastq files from

  • lane (int) – lane Fastqs must have come from

  • unaligned_dir (str) – subdirectory of ‘dirn’ with outputs from bcl2fastq

Returns:

list of Fastqs (for single ended data) or of

Fastq pairs (for pair ended data).

Return type:

List

auto_process_ngs.barcodes.splitter.split_paired_end(matcher, fastq_pairs, base_name=None, output_dir=None)

Split reads from paired end data

For each fastq file pair in ‘fastqs’, check reads against the index sequences in the BarcodeMatcher ‘matcher’ and write to an appropriate file.

Parameters:
  • matcher (BarcodeMatcher) – barcoder matcher instance

  • fastqs (list) – list of Fastq pairs to split

  • base_name (str) – optional, base name to use for output Fastq files

  • output_dir (str) – optional, path to directory to write output Fastqs to

auto_process_ngs.barcodes.splitter.split_single_end(matcher, fastqs, base_name=None, output_dir=None)

Split reads from single ended data

For each fastq file in ‘fastqs’, check reads against the index sequences in the BarcodeMatcher ‘matcher’ and write to an appropriate file.

Parameters:
  • matcher (BarcodeMatcher) – barcoder matcher instance

  • fastqs (list) – list of Fastqs to split

  • base_name (str) – optional, base name to use for output Fastq files

  • output_dir (str) – optional, path to directory to write output Fastqs to