auto_process_ngs.fastq_utils
Utility classes and functions for operating on Fastq files:
BaseFastqAttrs: base class for extracting info from Fastq file
IlluminaFastqAttrs: class for extracting info from Illumina Fastqs
FastqReadCounter: implements various methods for counting reads in FASTQ files
assign_barcodes_single_end: extract and assign inline barcodes
get_read_number: get the read number (1 or 2) from a Fastq file
get_read_count: count total reads across one or more Fastqs
pair_fastqs: automagically pair up FASTQ files
pair_fastqs_by_name: pair up FASTQ files based on their names
group_fastqs_by_name: group FASTQ files based on their names (more general version of ‘pair_fastqs_by_name’ which can handle arbitrary collections of read IDs)
remove_index_fastqs: remove index (I1/I2) Fastqs from a list
- class auto_process_ngs.fastq_utils.BaseFastqAttrs(fastq)
Base class for extracting information about a Fastq file
Instances of this class provide the follow attributes:
fastq: the original fastq file name basename: basename with NGS extensions stripped extension: full extension e.g. ‘.fastq.gz’ sample_name: name of the sample sample_number: integer (or None if no sample number) barcode_sequence: barcode sequence (string or None) lane_number: integer (or None if no lane number) read_number: integer (or None if no read number) set_number: integer (or None if no set number) is_index_read: boolean (True if index read, False if not)
Subclasses should process the supplied Fastq name and set these attributes appropriately.
- class auto_process_ngs.fastq_utils.FastqReadCounter
Implements various methods for counting reads in FASTQ file
The methods are:
simple: a wrapper for the FASTQFile.nreads() function
fastqiterator: counts reads using FASTQFile.FastqIterator
zcat_wc: runs ‘zcat | wc -l’ in the shell
reads_per_lane: counts reads by lane using FastqIterator
- static fastqiterator(fastq=None, fp=None)
Return number of reads in a FASTQ file
Uses the FASTQFile.FastqIterator class to do the counting.
- Parameters:
fastq – fastq(.gz) file
fp – open file descriptor for fastq file
- Returns:
Number of reads
- static reads_per_lane(fastq=None, fp=None)
Return counts of reads in each lane of FASTQ file
Uses the FASTQFile.FastqIterator class to do the counting, with counts split by lane.
- Parameters:
fastq – fastq(.gz) file
fp – open file descriptor for fastq file
- Returns:
- Dictionary where keys are lane numbers (as integers)
and values are number of reads in that lane.
- static simple(fastq=None, fp=None)
Return number of reads in a FASTQ file
Uses the FASTQFile.nreads function to do the counting.
- Parameters:
fastq – fastq(.gz) file
fp – open file descriptor for fastq file
- Returns:
Number of reads
- static zcat_wc(fastq=None, fp=None)
Return number of reads in a FASTQ file
Uses a system call to run ‘zcat FASTQ | wc -l’ to do the counting (or just ‘wc -l’ if not a gzipped FASTQ).
Note that this can only operate on fastq files (not on streams provided via the ‘fp’ argument; this will raise an exception).
- Parameters:
fastq – fastq(.gz) file
fp – open file descriptor for fastq file
- Returns:
Number of reads
- class auto_process_ngs.fastq_utils.IlluminaFastqAttrs(fastq)
Class for extracting information about Fastq files
Given the name of a Fastq file, extract data about the sample name, barcode sequence, lane number, read number and set number.
The name format can be a ‘full’ Fastq name as generated by CASAVA or bcl2fastq 1.8, which follows the general form:
<sample_name>_<barcode_sequence>_L<lane_number>_R<read_number>_<set_number>.astq.gz
e.g. for
NA10831_ATCACG_L002_R1_001.fastq.gz
sample_name = ‘NA10831_ATCACG_L002_R1_001’ barcode_sequence = ‘ATCACG’ lane_number = 2 read_number = 1 set_number = 1
Alternatively it can be a full Fastq name as generated by bcl2fastq2, of the general form:
<sample_name>_S<sample_number>_L<lane_number>_R<read_number>_001.fastq.gz
e.g. for
ES_exp1_S4_L003_R2_001.fastq.gz
sample_name = ‘ES_exp1’ sample_number = 4 lane_number = 3 read_number = 2 set_number = 1
bcl2fastq can also produce ‘index read’ Fastq files where the R1/R2 is replaced by I1, e.g.:
ES_exp1_S4_L003_I1_001.fastq.gz
Alternatively it can be a ‘reduced’ version where one or more of the components has been omitted (typically because they are redundant in uniquely distinguishing a Fastq file within a set of Fastqs).
The reduced formats are:
<sample_name> <sample_name>_L<lane_number> <sample_name>_<barcode_sequence> <sample_name>_<barcode_sequence>_L<lane_number>
with an optional suffix ‘_R<read_number>’ for paired end sets.
e.g.
NA10831 NA10831_L002 NA10831_ATCACG NA10831_ATCACG_L002
Finally, the name can be a non-standard name of the form:
<sample_name>.r<read_number>
or
<sample_name>.<barcode_sequence>.r<read_number>
In this case the sample_names are permitted to include dots.
Provides the follow attributes:
fastq: the original fastq file name sample_name: name of the sample (leading part of the name) sample_number: integer (or None if no sample number) barcode_sequence: barcode sequence (string or None) lane_number: integer (or None if no lane number) read_number: integer (or None if no read number) set_number: integer (or None if no set number) is_index_read: boolean (True if index read, False if not)
- auto_process_ngs.fastq_utils.assign_barcodes_single_end(fastq_in, fastq_out, n=5)
Extract inline barcodes and assign to Fastq read headers
Strips the first n bases from each read of the input FASTQ file and assigns it to the index sequence for that read in the output file.
If the supplied output file name ends with ‘.gz’ then it will be gzipped.
- Parameters:
fastq_in (str) – input FASTQ file (can be gzipped)
fastq_out (str) – output FASTQ file (will be gzipped if ending with ‘.gz’)
n (integer) – number of bases to extract and assign as index sequence (default: 5)
- Returns:
number of reads processed.
- Return type:
Integer
- auto_process_ngs.fastq_utils.get_read_count(fastqs)
Get the total count of reads across multiple Fastqs
- Parameters:
fastqs (list) – lpaths to one or more Fastq files
- Returns:
total number of reads across all files.
- Return type:
Integer
- auto_process_ngs.fastq_utils.get_read_number(fastq)
Get the read number (1 or 2) from a Fastq file
- Parameters:
fastq (str) – path to a Fastq file
- Returns:
read number (1 or 2) extracted from the first read.
- Return type:
Integer
- auto_process_ngs.fastq_utils.group_fastqs_by_name(fastqs, fastq_attrs=<class 'auto_process_ngs.fastq_utils.IlluminaFastqAttrs'>)
Group Fastq files based on their name
Grouping is based on the read number and type for the supplied Fastq files being present in the file names; the file contents are not examined.
Unpaired Fastqs (i.e. those for which a mate cannot be found) are returned as a “pair” where the equivalent R1 or R2 mate is missing.
- Parameters:
fastqs (list) – list of Fastqs to pair
fastq_attrs (BaseFastqAttrs) – optional, class to use for extracting data from the filename (default: IlluminaFastqAttrs)
- Returns:
- list of tuples (R1,R2) with the R1/R2 pairs,
or (R1,) or (R2,) for unpaired files.
- Return type:
- auto_process_ngs.fastq_utils.pair_fastqs(fastqs)
Automagically pair up FASTQ files
Given a list of FASTQ files, generate a list of R1/R2 pairs by examining the header for the first read in each file.
- Parameters:
fastqs (list) – list of paths to FASTQ files which will be paired.
- Returns:
- pair of lists of the form (paired,unpaired),
where paired is a list of tuples consisting of FASTQ R1/R2 pairs and unpaired is a list of FASTQs which couldn’t be paired.
- Return type:
Tuple
- auto_process_ngs.fastq_utils.pair_fastqs_by_name(fastqs, fastq_attrs=<class 'auto_process_ngs.fastq_utils.IlluminaFastqAttrs'>)
Pair Fastq files based on their name
Pairing is based on the read number for the supplied Fastq files being present in the file names; the file contents are not examined.
Unpaired Fastqs (i.e. those for which a mate cannot be found) are returned as a “pair” where the equivalent R1 or R2 mate is missing.
- Parameters:
fastqs (list) – list of Fastqs to pair
fastq_attrs (BaseFastqAttrs) – optional, class to use for extracting data from the filename (default: IlluminaFastqAttrs)
- Returns:
- list of tuples (R1,R2) with the R1/R2 pairs,
or (R1,) or (R2,) for unpaired files.
- Return type:
- auto_process_ngs.fastq_utils.remove_index_fastqs(fastqs, fastq_attrs=<class 'auto_process_ngs.fastq_utils.IlluminaFastqAttrs'>)
Remove index (I1/I2) Fastqs from list
- Parameters:
fastqs (list) – list of paths to Fastq files
fastq_attrs (BaseFastqAttrs) – class to use for extracting attributes from Fastq names (defaults to IlluminaFastqAttrs)
- Returns:
- input Fastq list with any index read
Fastqs removed.
- Return type: