auto_process_ngs.fastq_utils

Utility classes and functions for operating on Fastq files:

  • BaseFastqAttrs: base class for extracting info from Fastq file

  • IlluminaFastqAttrs: class for extracting info from Illumina Fastqs

  • FastqReadCounter: implements various methods for counting reads in FASTQ files

  • assign_barcodes_single_end: extract and assign inline barcodes

  • get_read_number: get the read number (1 or 2) from a Fastq file

  • get_read_count: count total reads across one or more Fastqs

  • pair_fastqs: automagically pair up FASTQ files

  • pair_fastqs_by_name: pair up FASTQ files based on their names

  • group_fastqs_by_name: group FASTQ files based on their names (more general version of ‘pair_fastqs_by_name’ which can handle arbitrary collections of read IDs)

  • remove_index_fastqs: remove index (I1/I2) Fastqs from a list

class auto_process_ngs.fastq_utils.BaseFastqAttrs(fastq)

Base class for extracting information about a Fastq file

Instances of this class provide the follow attributes:

fastq: the original fastq file name basename: basename with NGS extensions stripped extension: full extension e.g. ‘.fastq.gz’ sample_name: name of the sample sample_number: integer (or None if no sample number) barcode_sequence: barcode sequence (string or None) lane_number: integer (or None if no lane number) read_number: integer (or None if no read number) set_number: integer (or None if no set number) is_index_read: boolean (True if index read, False if not)

Subclasses should process the supplied Fastq name and set these attributes appropriately.

class auto_process_ngs.fastq_utils.FastqReadCounter

Implements various methods for counting reads in FASTQ file

The methods are:

  • simple: a wrapper for the FASTQFile.nreads() function

  • fastqiterator: counts reads using FASTQFile.FastqIterator

  • zcat_wc: runs ‘zcat | wc -l’ in the shell

  • reads_per_lane: counts reads by lane using FastqIterator

static fastqiterator(fastq=None, fp=None)

Return number of reads in a FASTQ file

Uses the FASTQFile.FastqIterator class to do the counting.

Parameters:
  • fastq – fastq(.gz) file

  • fp – open file descriptor for fastq file

Returns:

Number of reads

static reads_per_lane(fastq=None, fp=None)

Return counts of reads in each lane of FASTQ file

Uses the FASTQFile.FastqIterator class to do the counting, with counts split by lane.

Parameters:
  • fastq – fastq(.gz) file

  • fp – open file descriptor for fastq file

Returns:

Dictionary where keys are lane numbers (as integers)

and values are number of reads in that lane.

static simple(fastq=None, fp=None)

Return number of reads in a FASTQ file

Uses the FASTQFile.nreads function to do the counting.

Parameters:
  • fastq – fastq(.gz) file

  • fp – open file descriptor for fastq file

Returns:

Number of reads

static zcat_wc(fastq=None, fp=None)

Return number of reads in a FASTQ file

Uses a system call to run ‘zcat FASTQ | wc -l’ to do the counting (or just ‘wc -l’ if not a gzipped FASTQ).

Note that this can only operate on fastq files (not on streams provided via the ‘fp’ argument; this will raise an exception).

Parameters:
  • fastq – fastq(.gz) file

  • fp – open file descriptor for fastq file

Returns:

Number of reads

class auto_process_ngs.fastq_utils.IlluminaFastqAttrs(fastq)

Class for extracting information about Fastq files

Given the name of a Fastq file, extract data about the sample name, barcode sequence, lane number, read number and set number.

The name format can be a ‘full’ Fastq name as generated by CASAVA or bcl2fastq 1.8, which follows the general form:

<sample_name>_<barcode_sequence>_L<lane_number>_R<read_number>_<set_number>.astq.gz

e.g. for

NA10831_ATCACG_L002_R1_001.fastq.gz

sample_name = ‘NA10831_ATCACG_L002_R1_001’ barcode_sequence = ‘ATCACG’ lane_number = 2 read_number = 1 set_number = 1

Alternatively it can be a full Fastq name as generated by bcl2fastq2, of the general form:

<sample_name>_S<sample_number>_L<lane_number>_R<read_number>_001.fastq.gz

e.g. for

ES_exp1_S4_L003_R2_001.fastq.gz

sample_name = ‘ES_exp1’ sample_number = 4 lane_number = 3 read_number = 2 set_number = 1

bcl2fastq can also produce ‘index read’ Fastq files where the R1/R2 is replaced by I1, e.g.:

ES_exp1_S4_L003_I1_001.fastq.gz

Alternatively it can be a ‘reduced’ version where one or more of the components has been omitted (typically because they are redundant in uniquely distinguishing a Fastq file within a set of Fastqs).

The reduced formats are:

<sample_name> <sample_name>_L<lane_number> <sample_name>_<barcode_sequence> <sample_name>_<barcode_sequence>_L<lane_number>

with an optional suffix ‘_R<read_number>’ for paired end sets.

e.g.

NA10831 NA10831_L002 NA10831_ATCACG NA10831_ATCACG_L002

Finally, the name can be a non-standard name of the form:

<sample_name>.r<read_number>

or

<sample_name>.<barcode_sequence>.r<read_number>

In this case the sample_names are permitted to include dots.

Provides the follow attributes:

fastq: the original fastq file name sample_name: name of the sample (leading part of the name) sample_number: integer (or None if no sample number) barcode_sequence: barcode sequence (string or None) lane_number: integer (or None if no lane number) read_number: integer (or None if no read number) set_number: integer (or None if no set number) is_index_read: boolean (True if index read, False if not)

auto_process_ngs.fastq_utils.assign_barcodes_single_end(fastq_in, fastq_out, n=5)

Extract inline barcodes and assign to Fastq read headers

Strips the first n bases from each read of the input FASTQ file and assigns it to the index sequence for that read in the output file.

If the supplied output file name ends with ‘.gz’ then it will be gzipped.

Parameters:
  • fastq_in (str) – input FASTQ file (can be gzipped)

  • fastq_out (str) – output FASTQ file (will be gzipped if ending with ‘.gz’)

  • n (integer) – number of bases to extract and assign as index sequence (default: 5)

Returns:

number of reads processed.

Return type:

Integer

auto_process_ngs.fastq_utils.get_read_count(fastqs)

Get the total count of reads across multiple Fastqs

Parameters:

fastqs (list) – lpaths to one or more Fastq files

Returns:

total number of reads across all files.

Return type:

Integer

auto_process_ngs.fastq_utils.get_read_number(fastq)

Get the read number (1 or 2) from a Fastq file

Parameters:

fastq (str) – path to a Fastq file

Returns:

read number (1 or 2) extracted from the first read.

Return type:

Integer

auto_process_ngs.fastq_utils.group_fastqs_by_name(fastqs, fastq_attrs=<class 'auto_process_ngs.fastq_utils.IlluminaFastqAttrs'>)

Group Fastq files based on their name

Grouping is based on the read number and type for the supplied Fastq files being present in the file names; the file contents are not examined.

Unpaired Fastqs (i.e. those for which a mate cannot be found) are returned as a “pair” where the equivalent R1 or R2 mate is missing.

Parameters:
  • fastqs (list) – list of Fastqs to pair

  • fastq_attrs (BaseFastqAttrs) – optional, class to use for extracting data from the filename (default: IlluminaFastqAttrs)

Returns:

list of tuples (R1,R2) with the R1/R2 pairs,

or (R1,) or (R2,) for unpaired files.

Return type:

List

auto_process_ngs.fastq_utils.pair_fastqs(fastqs)

Automagically pair up FASTQ files

Given a list of FASTQ files, generate a list of R1/R2 pairs by examining the header for the first read in each file.

Parameters:

fastqs (list) – list of paths to FASTQ files which will be paired.

Returns:

pair of lists of the form (paired,unpaired),

where paired is a list of tuples consisting of FASTQ R1/R2 pairs and unpaired is a list of FASTQs which couldn’t be paired.

Return type:

Tuple

auto_process_ngs.fastq_utils.pair_fastqs_by_name(fastqs, fastq_attrs=<class 'auto_process_ngs.fastq_utils.IlluminaFastqAttrs'>)

Pair Fastq files based on their name

Pairing is based on the read number for the supplied Fastq files being present in the file names; the file contents are not examined.

Unpaired Fastqs (i.e. those for which a mate cannot be found) are returned as a “pair” where the equivalent R1 or R2 mate is missing.

Parameters:
  • fastqs (list) – list of Fastqs to pair

  • fastq_attrs (BaseFastqAttrs) – optional, class to use for extracting data from the filename (default: IlluminaFastqAttrs)

Returns:

list of tuples (R1,R2) with the R1/R2 pairs,

or (R1,) or (R2,) for unpaired files.

Return type:

List

auto_process_ngs.fastq_utils.remove_index_fastqs(fastqs, fastq_attrs=<class 'auto_process_ngs.fastq_utils.IlluminaFastqAttrs'>)

Remove index (I1/I2) Fastqs from list

Parameters:
  • fastqs (list) – list of paths to Fastq files

  • fastq_attrs (BaseFastqAttrs) – class to use for extracting attributes from Fastq names (defaults to IlluminaFastqAttrs)

Returns:

input Fastq list with any index read

Fastqs removed.

Return type:

List