auto_process_ngs.barcodes.analysis
Classes and functions for analysing barcodes (i.e. index sequences) from FASTQ read headers:
BarcodeCounter: utility class for counting barcode sequences
BarcodeGroup: class for sorting groups of related barcodes
SampleSheetBarcode: class for handling barcode information from a sample sheet
Reporter: class for generating reports of barcode statistics
report_barcodes: populate Reporter with analysis of barcode stats
detect_barcodes_warnings: check whether reports include warnings
make_title: turn a string into a Markdown/rst title
percent: return values as percentage
- class auto_process_ngs.barcodes.analysis.BarcodeCounter(*counts_files)
Utility class to mange counts of barcode sequences
The BarcodeCounter class manages the counting of barcode index sequences and facilitates subsequent analysis of their frequency.
Usage
To set up a new (empty) BarcodeCounter:
>>> bc = BarcodeCounter()
To count the sequences from a fastq file (keeping the lane association):
>>> for read in FastqIterator("example.fq"): ... bc.count_barcode(read.seqid.index_sequence, ... lane=read.seqid.flowcell_lane)
To get a list of the barcodes sorted by order of frequency (highest to lowest) across all lanes:
>>> bc.barcodes()
To get the count for a barcode in lane 1:
>>> bc.counts('TAGGCATGTAGCCTCT',1)
Reading and writing counts
Counts can be output to a file:
>>> bc.write("counts.out")
and subsequently loaded into a new BarcodeCounter:
>>> bc2 = BarcodeCounter("counts.out")
Multiple counts files can be combined:
>>> bc_all = BarcodeCounter("counts1.out","counts2.out")
Grouping barcodes
To put barcodes into groups use the ‘group’ method, e.g. to group all barcodes from lane 2:
>>> groups = bc.group(2)
which produces a list of BarcodeGroup instances.
- analyse(lane=None, sample_sheet=None, cutoff=None, mismatches=0, minimum_read_fraction=1e-06)
Analyse barcode frequencies
Returns a dictionary with the following keys:
barcodes: list of barcodes (or reference barcodes, if mismatches > 0)
cutoff: the specified cutoff fraction
mismatches: the specified number of mismatches to allow
total_reads: the total number of reads for the specified lane (or all reads, if no lane was specified)
coverage: the number of reads after cutoffs have been applied
counts: dictionary with barcodes from the ‘barcodes’ list as keys; each key points to a dictionary with keys: * reads: number of reads associated with this barcode
(or group, if mismatches > 0)
sample: name of the associated sample (if a sample sheet was supplied, otherwise ‘None’)
sequences: number of sequences in the group (always 1 if mismatches == 0)
- Parameters:
lane (integer) – lane to restrict analysis to (None analyses all lanes)
sample_sheet (str) – sample sheet file to compare barcodes against (None skips comparison)
cutoff (float) – if mismatches == 0 then barcodes must have at least this fraction of reads to be included; (if mismatches > 0 then this condition is applied to groups instead)
mismatches (integer) – maximum number of mismatched bases allowed when matching barcodes (default is 0 i.e. exact matches only)
minimum_read_fraction – speed-up parameter, excludes barcodes with less than this fraction of associated reads (speeds up the grouping calculation at the cost of some precision)
- barcode_lengths(lane=None)
Return lengths of barcode sequences
Returns a list of the barcode sequence lengths.
- Parameters:
lane (int) – if specified then restricts the list to barcodes that appear in the named lane (default is to get lengths from all barcodes in all lanes)
- barcodes(lane=None)
List barcodes sorted into order of frequency
Optionally restrict the list to frequency in a specified lane
- Parameters:
lane (int) – if not None then only list barcodes that appear in the specified lane (and ordering only uses the frequencies in that lane)
- Returns:
list of barcodes.
- Return type:
- count_barcode(barcode, lane=None, incr=1)
Increment count of a barcode sequence
- Parameters:
barcode (str) – barcode sequence to count
lane (int) – lane that the barcode appears in (None if unknown)
incr (int) – increment the count for the barcode in the lane by this amount (defaults to 1)
- counts(barcode, lane=None)
Return the number of counts for the barcode
If ‘lane’ is None then return counts across all lanes.
- counts_all(barcode)
Number of counts for the barcode across all lanes
- filter_barcodes(cutoff=None, lane=None)
Return subset of index sequences filtered by specified criteria
- Parameters:
cutoff (float) – barcodes must account for at least this fraction of all reads to be included. Total reads are all reads if lane is ‘None’, or else total for the specified lane
lane (integer) – barcodes must appear in this lane
- Returns:
list of barcodes.
- Return type:
- group(lane, mismatches=2, cutoff=None, seed_barcodes=None, minimum_read_fraction=1e-06)
Put barcodes into groups of similar sequences
Returns a list of BarcodeGroup instances
- Parameters:
lane – lane number to restrict the pool of barcodes to (set to None to use barcodes from all lanes)
mismatches – number of mismatches to allow when creating groups
cutoff – minimum number of reads as a fraction of all reads that a group must contain to be included (set to None to disable cut-off)
minimum_read_fraction – speed-up parameter, excludes barcodes with less than this fraction of associated reads. Speeds up the grouping calculation at the cost of some precision
seed_barcodes (list) – optional, set of barcode sequences (typically, expected index sequences from a sample sheet) which will be used to build groups around even if they have low associated counts
- property lanes
List of lane numbers that barcodes are counted in
- Returns:
- list of integer lane numbers in
ascending order.
- Return type:
- nreads(lane=None)
Number of reads counted
If lane is None then is the number of reads across all lanes (i.e. all reads from all files).
- Parameters:
lane (int) – only report number of reads for the specified lane
- Returns:
number of reads.
- Return type:
Integer
- read(filen)
Read count data from a file
The format of the ‘counts’ file is four column tab-delimited file:
Column 1: lane
Column 2: rank (ignored when reading in)
Column 3: barcode sequence
Column 4: counts
Older ‘counts’ files are three column tab-delimited with the following columns:
Column 1: rank (ignored when reading in)
Column 2: barcode sequence
Column 3: counts
In either case if the lane is missing or cannot be interpreted as an integer then it’s set to be ‘None’.
- write(filen)
Write barcode data to a file
- class auto_process_ngs.barcodes.analysis.BarcodeGroup(barcode, counts=0)
Class for storing groups of related barcodes
A group stores a representative ‘reference’ sequence and none or more related sequences, along with the total counts for all the sequences combined.
Create a group:
>>> grp = BarcodeGroup('GCTACGCTCTAAGCCT',2894178)
Add a barcode to the group if it’s related:
>>> if grp.match('GCTCCGCTCTAAGCCT'): ... grp.add('GCTCCGCTCTAAGCCT',94178)
List the barcodes in the group:
>>> grp.sequences
Get the total number of counts:
>>> grp.counts
Retrieve the reference barcode:
>>> grp.reference
- add(seq, counts)
Add a sequence to the group
- Parameters:
seq (str) – sequence to add to the group
counts (int) – count for that sequence (will be added to the total for the group)
- property counts
Return the total number of counts for all sequences
- match(seq, mismatches=2)
Check if a sequence matches the reference
The supplied sequence is checked against the stored reference, and is a match if the number of mismatching positions are less than or equal to the number of allowed mismatches.
Note that:
for dual index sequences and references (i.e. sequences which contain either a ‘+’ or ‘-’ character to separate the indices within the sequence), each index is checked separately and the mismatch limit is applied per index (i.e. not across the sequence as a whole);
positions with an ‘N’s in either sequence automatically count as a mismatch in that position;
sequences which differ in length automatically fail to match.
- Parameters:
seq (str) – sequence to check against the reference sequence
mismatches (int) – maximum number of mismatches that are allowed for the input sequence to be considered to match the reference (default is 2).
- Returns:
- True if sequences match within the
specified tolerance, False otherwise (or if sequence lengths differ)
- Return type:
Boolean
- property reference
Return the reference sequence for the group
- property sequences
List the barcode sequences in the group
- class auto_process_ngs.barcodes.analysis.Reporter
Class for generating reports of barcode statistics
Add arbitrary blocks of text with optional keyword ‘tags’, which can then be written to a text file, stream, or as an XLS file.
Usage:
Make a new Reporter:
>>> r = Reporter()
Add a title:
>>> r.add("This is the title",title=True)
Add a heading:
>>> r.add("A heading",heading=True)
Add some text:
>>> r.add("Lorem ipsum")
Write to file:
>>> r.write("report.txt")
Write as XLS:
>>> r.write_xls("report.xls")
Write as HTML:
>>> r.write_html("report.html",title="Barcodes")
- add(content, **kws)
Add content to the report
Supplied content is appended to the existing content.
Also arbitrary keyword-value parts can be associated with the content.
- property has_warnings
Report whether warnings were found in analysis
- write(fp=None, filen=None, title=None)
Write the report to a file or stream
- write_html(html_file, title=None, no_styles=False)
Write the report to a HTML file
- write_xls(xls_file, title=None)
Write the report to an XLS file
- class auto_process_ngs.barcodes.analysis.SampleSheetBarcodes(sample_sheet_file)
Class for handling index sequences from a sample sheet
Given a SampleSheet.csv file this class can extract the index sequences (aka barcodes) corresponding to sample names, and provides methods to look up one from the other.
(Note that for dual index sequences the indices are joined with a ‘+’ character, so for example ‘AGCCCTT’ and ‘GTTACAT’ becomes ‘AGCCCTT+GTTACAT’.)
Create an initial lookup object:
>>> s = SampleSheetBarcodes('SampleSheet.csv')
Get a list of barcodes in lane 2:
>>> s.barcodes(1)
Look up the sample name in lane 2 matching a barcode:
>>> s.lookup_sample('ATTGTG',2)
Look up the sample name matching a barcode in all lanes (e.g. if the sample sheet doesn’t include explicit lane information):
>>> s.lookup_sample('ATTGTG')
- barcodes(lane=None)
Return a list of index sequences
If a lane is specified then a list of normalised barcode sequences for that lane is returned; if no lane is specified (or lane is ‘None’) then all barcode sequences are returned.
- Parameters:
lane (int) – lane to restrict barcodes to, or None to get all barcode sequences
- lookup_barcode(sample, lane=None)
Return normalised barcode sequence matching sample name
- Parameters:
sample (str) – sample name to get normalised barcode sequence for
lane (int) – optional, lane to look for matching barcode in
- lookup_sample(barcode, lane=None)
Return sample name matching barcode sequence
- Parameters:
barcode (str) – normalised barcode sequence to get sample name for
lane (int) – optional, lane to look for matching sample in
- samples(lane=None)
Return a list of sample names
If a lane is specified then a list of sample names that lane is returned; if no lane is specified (or lane is ‘None’) then all barcode sequences are returned.
- Parameters:
lane (int) – lane to restrict sample names to, or None to get all sample names
- auto_process_ngs.barcodes.analysis.detect_barcodes_warnings(report_file)
Look for warning text in barcode.report file
- Parameters:
report_file (str) – path to barcode report file
- Returns:
True if warnings were found, False if not.
- Return type:
Boolean
- auto_process_ngs.barcodes.analysis.make_title(text, underline='=')
Turn a string into a Markdown/rst title
- Parameters:
text (str) – text to make into title
underline (str) – underline character (defaults to ‘=’)
- Returns:
title text.
- Return type:
String
- auto_process_ngs.barcodes.analysis.percent(num, denom)
Return values as percentage
- Parameters:
num (float) – number to express as percentage
denom (float) – denominator
- Returns:
value expressed as a percentage.
- Return type:
Float
- auto_process_ngs.barcodes.analysis.report_barcodes(counts, lane=None, sample_sheet=None, cutoff=None, mismatches=0, minimum_read_fraction=1e-06, reporter=None)
Report barcode statistics
- Parameters:
counts (BarcodeCounter) – BarcodeCounter instance with barcode data to report on
lane (integer) – optional, restrict report to the specified lane (‘None’ reports across all lanes)
sample_sheet (str) – optional, sample sheet file to match actual barcodes/groups against
mismatches (integer) – optional, maximum number of mismatches to allow for grouping; zero means there will be no grouping (default)
cutoff (float) – optional, decimal fraction representing minimum percentage of total reads that must be associated with a barcode in order to be included in analyses (e.g. 0.001 = 0.1%). Default is to include all barcodes
minimum_read_fraction – speed-up parameter, excludes barcodes with less than this fraction of associated reads (speeds up the grouping calculation at the cost of some precision)
reporter (Reporter) – Reporter instance to write results to for reporting (optional, default is to write to stdout)
- Returns:
Reporter instance
- Return type: