auto_process_ngs.fastq_utils

Utility classes and functions for operating on Fastq files:

  • BaseFastqAttrs: base class for extracting info from Fastq file

  • IlluminaFastqAttrs: class for extracting info from Illumina Fastqs

  • FastqReadCounter: implements various methods for counting reads in FASTQ files

  • assign_barcodes_single_end: extract and assign inline barcodes

  • get_read_number: get the read number (1 or 2) from a Fastq file

  • get_read_count: count total reads across one or more Fastqs

  • pair_fastqs: automagically pair up FASTQ files

  • pair_fastqs_by_name: pair up FASTQ files based on their names

  • group_fastqs_by_name: group FASTQ files based on their names (more general version of ‘pair_fastqs_by_name’ which can handle arbitrary collections of read IDs)

  • remove_index_fastqs: remove index (I1/I2) Fastqs from a list

  • build_custom_fastq_attrs_regex: make regular expression patterns and format templates for custom ‘FastqAttrs’ classes

  • get_custom_fastq_attrs_class: create a custom ‘FastqAttrs’ class for handling non-canonical Fastq filenames.

class auto_process_ngs.fastq_utils.BaseFastqAttrs(fastq)

Base class for extracting information about a Fastq file

Instances of this class provide the follow attributes:

fastq: the original fastq file name basename: basename with path and NGS extensions stripped extension: full extension e.g. ‘.fastq.gz’ type: file type (‘fastq’ or ‘bam’) compression: compression type (‘gz’ or ‘bz2’) sample_name: name of the sample sample_number: integer (or None if no sample number) barcode_sequence: barcode sequence (string or None) lane_number: integer (or None if no lane number) read_number: integer (or None if no read number) set_number: integer (or None if no set number) is_index_read: boolean (True if index read, False if not)

The ‘fastq’, ‘basename’, ‘extension’, ‘type’ and ‘compression’, attributes are set by this class; the remaining attributes (at minimum ‘sample_name’) should be set by the subclass.

Subclasses should also implement the ‘fastq_basename’ and ‘bam_basename’ methods.

bam_basename(fq_attrs=None)

Return basename for an associated BAM file

Should be overridden by subclasses.

fastq_basename(fq_attrs=None)

Return the basename of the Fastq file

fq_attrs()

Return attributes as a dictionary

class auto_process_ngs.fastq_utils.FastqReadCounter

Implements various methods for counting reads in FASTQ file

The methods are:

  • simple: a wrapper for the FASTQFile.nreads() function

  • fastqiterator: counts reads using FASTQFile.FastqIterator

  • zcat_wc: runs ‘zcat | wc -l’ in the shell

  • reads_per_lane: counts reads by lane using FastqIterator

static fastqiterator(fastq=None, fp=None)

Return number of reads in a FASTQ file

Uses the FASTQFile.FastqIterator class to do the counting.

Parameters:
  • fastq – fastq(.gz) file

  • fp – open file descriptor for fastq file

Returns:

Number of reads

static reads_per_lane(fastq=None, fp=None)

Return counts of reads in each lane of FASTQ file

Uses the FASTQFile.FastqIterator class to do the counting, with counts split by lane.

Parameters:
  • fastq – fastq(.gz) file

  • fp – open file descriptor for fastq file

Returns:

Dictionary where keys are lane numbers (as integers)

and values are number of reads in that lane.

static simple(fastq=None, fp=None)

Return number of reads in a FASTQ file

Uses the FASTQFile.nreads function to do the counting.

Parameters:
  • fastq – fastq(.gz) file

  • fp – open file descriptor for fastq file

Returns:

Number of reads

static zcat_wc(fastq=None, fp=None)

Return number of reads in a FASTQ file

Uses a system call to run ‘zcat FASTQ | wc -l’ to do the counting (or just ‘wc -l’ if not a gzipped FASTQ).

Note that this can only operate on fastq files (not on streams provided via the ‘fp’ argument; this will raise an exception).

Parameters:
  • fastq – fastq(.gz) file

  • fp – open file descriptor for fastq file

Returns:

Number of reads

class auto_process_ngs.fastq_utils.IlluminaFastqAttrs(fastq)

Class for extracting information about Fastq files

Given the name of a Fastq file, extract data about the sample name, barcode sequence, lane number, read number and set number.

The name format can be a ‘full’ Fastq name as generated by CASAVA or bcl2fastq 1.8, which follows the general form:

<sample_name>_<barcode_sequence>_L<lane_number>_R<read_number>_<set_number>.astq.gz

e.g. for

NA10831_ATCACG_L002_R1_001.fastq.gz

sample_name = ‘NA10831_ATCACG_L002_R1_001’ barcode_sequence = ‘ATCACG’ lane_number = 2 read_number = 1 set_number = 1

Alternatively it can be a full Fastq name as generated by bcl2fastq2, of the general form:

<sample_name>_S<sample_number>_L<lane_number>_R<read_number>_001.fastq.gz

e.g. for

ES_exp1_S4_L003_R2_001.fastq.gz

sample_name = ‘ES_exp1’ sample_number = 4 lane_number = 3 read_number = 2 set_number = 1

bcl2fastq can also produce ‘index read’ Fastq files where the R1/R2 is replaced by I1, e.g.:

ES_exp1_S4_L003_I1_001.fastq.gz

Alternatively it can be a ‘reduced’ version where one or more of the components has been omitted (typically because they are redundant in uniquely distinguishing a Fastq file within a set of Fastqs).

The reduced formats are:

<sample_name> <sample_name>_L<lane_number> <sample_name>_<barcode_sequence> <sample_name>_<barcode_sequence>_L<lane_number>

with an optional suffix ‘_R<read_number>’ for paired end sets.

e.g.

NA10831 NA10831_L002 NA10831_ATCACG NA10831_ATCACG_L002

Finally, the name can be a non-standard name of the form:

<sample_name>.r<read_number>

or

<sample_name>.<barcode_sequence>.r<read_number>

In this case the sample_names are permitted to include dots.

Provides the follow attributes:

fastq: the original fastq file name sample_name: name of the sample (leading part of the name) sample_number: integer (or None if no sample number) barcode_sequence: barcode sequence (string or None) lane_number: integer (or None if no lane number) read_number: integer (or None if no read number) set_number: integer (or None if no set number) is_index_read: boolean (True if index read, False if not)

bam_basename(fq_attrs=None)

Construct basename for BAM files

fastq_basename(fq_attrs=None)

Reconstruct basename for Fastq files

auto_process_ngs.fastq_utils.assign_barcodes_single_end(fastq_in, fastq_out, n=5)

Extract inline barcodes and assign to Fastq read headers

Strips the first n bases from each read of the input FASTQ file and assigns it to the index sequence for that read in the output file.

If the supplied output file name ends with ‘.gz’ then it will be gzipped.

Parameters:
  • fastq_in (str) – input FASTQ file (can be gzipped)

  • fastq_out (str) – output FASTQ file (will be gzipped if ending with ‘.gz’)

  • n (integer) – number of bases to extract and assign as index sequence (default: 5)

Returns:

number of reads processed.

Return type:

Integer

auto_process_ngs.fastq_utils.build_custom_fastq_attrs_regex(pattern)

Build regex pattern and string templates for Fastq filenames

Given a glob-like pattern describing a Fastq file name format, returns a tuple of (REGEX_PATTERN, FORMAT_STRING), which can be used to extract attributes from a Fastq file name and regenerate the name using those attributes.

Patterns are strings which should include the elements ‘{SAMPLE}’ and ‘{READ}’ along with constant characters and wildcard elements ‘*’.

For example:

{SAMPLE}_*_{READ}

would generate a regular expression and template for matching and regenerating file names of the form:

PJB1_1.fastq

where the sample name would be ‘PJB1’ and the read number would be ‘1’.

Parameters:
  • pattern (str) – glob-like pattern describing a Fastq file

  • format (name) –

Returns:

a tuple consisting of regular expression pattern and string template derived from the input pattern.

Return type:

Tuple

auto_process_ngs.fastq_utils.get_custom_fastqattrs_class(pattern)

Create and return a custom FastqAttrs class

This is a factory function that can be used to create custom FastqAttr-type classes (subclassed from the BaseFastqAttr class) to extract basic sample name and read number information from non-canonical Fastq file names.

The function takes a single argument which is a basic glob-like pattern describing the expected filename.

Patterns should include the strings {SAMPLE} in the position where the sample name is expected, and {READ} where the read number is expected. The rest of the pattern should consist of some combination of the “fixed” (i.e. invariant) parts of the name and the “*” for the variable parts which are not either sample name or read number.

Some examples:

  • if the files are “S1-1_1.fastq”, “S1-1_2.fastq”, etc then the pattern “{SAMPLE}_{READ}” will enable name and read number extraction

  • if the files are “S1-1_R1.fastq”, “S1-1_R2.fastq”, etc then the pattern could be “{SAMPLE}_R{READ}”

  • for simple Illumina-style names (e.g. “SMP1-1_S1_L002_R2_001.fastq” the pattern “{SAMPLE}_S*_L*_R{READ}_001” could be a suitable pattern

Note that patterns should not include file extensions.

Parameters:

pattern (str) – basic glob-like pattern to define sample and read number

Returns:

custom subclass of BaseFastqAttrs for parsing files with the specified name structure.

Return type:

Class

auto_process_ngs.fastq_utils.get_read_count(fastqs)

Get the total count of reads across multiple Fastqs

Parameters:

fastqs (list) – lpaths to one or more Fastq files

Returns:

total number of reads across all files.

Return type:

Integer

auto_process_ngs.fastq_utils.get_read_number(fastq)

Get the read number (1 or 2) from a Fastq file

Parameters:

fastq (str) – path to a Fastq file

Returns:

read number (1 or 2) extracted from the first read.

Return type:

Integer

auto_process_ngs.fastq_utils.group_fastqs_by_name(fastqs, fastq_attrs=<class 'auto_process_ngs.fastq_utils.IlluminaFastqAttrs'>)

Group Fastq files based on their name

Grouping is based on the read number and type for the supplied Fastq files being present in the file names; the file contents are not examined.

Unpaired Fastqs (i.e. those for which a mate cannot be found) are returned as a “pair” where the equivalent R1 or R2 mate is missing.

Parameters:
  • fastqs (list) – list of Fastqs to pair

  • fastq_attrs (BaseFastqAttrs) – optional, class to use for extracting data from the filename (default: IlluminaFastqAttrs)

Returns:

list of tuples (R1,R2) with the R1/R2 pairs,

or (R1,) or (R2,) for unpaired files.

Return type:

List

auto_process_ngs.fastq_utils.pair_fastqs(fastqs)

Automagically pair up FASTQ files

Given a list of FASTQ files, generate a list of R1/R2 pairs by examining the header for the first read in each file.

Parameters:

fastqs (list) – list of paths to FASTQ files which will be paired.

Returns:

pair of lists of the form (paired,unpaired),

where paired is a list of tuples consisting of FASTQ R1/R2 pairs and unpaired is a list of FASTQs which couldn’t be paired.

Return type:

Tuple

auto_process_ngs.fastq_utils.pair_fastqs_by_name(fastqs, fastq_attrs=<class 'auto_process_ngs.fastq_utils.IlluminaFastqAttrs'>)

Pair Fastq files based on their name

Pairing is based on the read number for the supplied Fastq files being present in the file names; the file contents are not examined.

Unpaired Fastqs (i.e. those for which a mate cannot be found) are returned as a “pair” where the equivalent R1 or R2 mate is missing.

Parameters:
  • fastqs (list) – list of Fastqs to pair

  • fastq_attrs (BaseFastqAttrs) – optional, class to use for extracting data from the filename (default: IlluminaFastqAttrs)

Returns:

list of tuples (R1,R2) with the R1/R2 pairs,

or (R1,) or (R2,) for unpaired files.

Return type:

List

auto_process_ngs.fastq_utils.remove_index_fastqs(fastqs, fastq_attrs=<class 'auto_process_ngs.fastq_utils.IlluminaFastqAttrs'>)

Remove index (I1/I2) Fastqs from list

Parameters:
  • fastqs (list) – list of paths to Fastq files

  • fastq_attrs (BaseFastqAttrs) – class to use for extracting attributes from Fastq names (defaults to IlluminaFastqAttrs)

Returns:

input Fastq list with any index read

Fastqs removed.

Return type:

List