auto_process_ngs.barcodes.analysis

Classes and functions for analysing barcodes (i.e. index sequences) from FASTQ read headers:

  • BarcodeCounter: utility class for counting barcode sequences

  • BarcodeGroup: class for sorting groups of related barcodes

  • SampleSheetBarcode: class for handling barcode information from a sample sheet

  • Reporter: class for generating reports of barcode statistics

  • report_barcodes: populate Reporter with analysis of barcode stats

  • detect_barcodes_warnings: check whether reports include warnings

  • make_title: turn a string into a Markdown/rst title

  • percent: return values as percentage

class auto_process_ngs.barcodes.analysis.BarcodeCounter(*counts_files)

Utility class to mange counts of barcode sequences

The BarcodeCounter class manages the counting of barcode index sequences and facilitates subsequent analysis of their frequency.

Usage

To set up a new (empty) BarcodeCounter:

>>> bc = BarcodeCounter()

To count the sequences from a fastq file (keeping the lane association):

>>> for read in FastqIterator("example.fq"):
...   bc.count_barcode(read.seqid.index_sequence,
...                    lane=read.seqid.flowcell_lane)

To get a list of the barcodes sorted by order of frequency (highest to lowest) across all lanes:

>>> bc.barcodes()

To get the count for a barcode in lane 1:

>>> bc.counts('TAGGCATGTAGCCTCT',1)

Reading and writing counts

Counts can be output to a file:

>>> bc.write("counts.out")

and subsequently loaded into a new BarcodeCounter:

>>> bc2 = BarcodeCounter("counts.out")

Multiple counts files can be combined:

>>> bc_all = BarcodeCounter("counts1.out","counts2.out")

Grouping barcodes

To put barcodes into groups use the ‘group’ method, e.g. to group all barcodes from lane 2:

>>> groups = bc.group(2)

which produces a list of BarcodeGroup instances.

analyse(lane=None, sample_sheet=None, cutoff=None, mismatches=0, minimum_read_fraction=1e-06)

Analyse barcode frequencies

Returns a dictionary with the following keys:

  • barcodes: list of barcodes (or reference barcodes, if mismatches > 0)

  • cutoff: the specified cutoff fraction

  • mismatches: the specified number of mismatches to allow

  • total_reads: the total number of reads for the specified lane (or all reads, if no lane was specified)

  • coverage: the number of reads after cutoffs have been applied

  • counts: dictionary with barcodes from the ‘barcodes’ list as keys; each key points to a dictionary with keys: * reads: number of reads associated with this barcode

    (or group, if mismatches > 0)

    • sample: name of the associated sample (if a sample sheet was supplied, otherwise ‘None’)

    • sequences: number of sequences in the group (always 1 if mismatches == 0)

Parameters:
  • lane (integer) – lane to restrict analysis to (None analyses all lanes)

  • sample_sheet (str) – sample sheet file to compare barcodes against (None skips comparison)

  • cutoff (float) – if mismatches == 0 then barcodes must have at least this fraction of reads to be included; (if mismatches > 0 then this condition is applied to groups instead)

  • mismatches (integer) – maximum number of mismatched bases allowed when matching barcodes (default is 0 i.e. exact matches only)

  • minimum_read_fraction – speed-up parameter, excludes barcodes with less than this fraction of associated reads (speeds up the grouping calculation at the cost of some precision)

barcode_lengths(lane=None)

Return lengths of barcode sequences

Returns a list of the barcode sequence lengths.

Parameters:

lane (int) – if specified then restricts the list to barcodes that appear in the named lane (default is to get lengths from all barcodes in all lanes)

barcodes(lane=None)

List barcodes sorted into order of frequency

Optionally restrict the list to frequency in a specified lane

Parameters:

lane (int) – if not None then only list barcodes that appear in the specified lane (and ordering only uses the frequencies in that lane)

Returns:

list of barcodes.

Return type:

List

count_barcode(barcode, lane=None, incr=1)

Increment count of a barcode sequence

Parameters:
  • barcode (str) – barcode sequence to count

  • lane (int) – lane that the barcode appears in (None if unknown)

  • incr (int) – increment the count for the barcode in the lane by this amount (defaults to 1)

counts(barcode, lane=None)

Return the number of counts for the barcode

If ‘lane’ is None then return counts across all lanes.

counts_all(barcode)

Number of counts for the barcode across all lanes

filter_barcodes(cutoff=None, lane=None)

Return subset of index sequences filtered by specified criteria

Parameters:
  • cutoff (float) – barcodes must account for at least this fraction of all reads to be included. Total reads are all reads if lane is ‘None’, or else total for the specified lane

  • lane (integer) – barcodes must appear in this lane

Returns:

list of barcodes.

Return type:

List

group(lane, mismatches=2, cutoff=None, seed_barcodes=None, minimum_read_fraction=1e-06)

Put barcodes into groups of similar sequences

Returns a list of BarcodeGroup instances

Parameters:
  • lane – lane number to restrict the pool of barcodes to (set to None to use barcodes from all lanes)

  • mismatches – number of mismatches to allow when creating groups

  • cutoff – minimum number of reads as a fraction of all reads that a group must contain to be included (set to None to disable cut-off)

  • minimum_read_fraction – speed-up parameter, excludes barcodes with less than this fraction of associated reads. Speeds up the grouping calculation at the cost of some precision

  • seed_barcodes (list) – optional, set of barcode sequences (typically, expected index sequences from a sample sheet) which will be used to build groups around even if they have low associated counts

property lanes

List of lane numbers that barcodes are counted in

Returns:

list of integer lane numbers in

ascending order.

Return type:

List

nreads(lane=None)

Number of reads counted

If lane is None then is the number of reads across all lanes (i.e. all reads from all files).

Parameters:

lane (int) – only report number of reads for the specified lane

Returns:

number of reads.

Return type:

Integer

read(filen)

Read count data from a file

The format of the ‘counts’ file is four column tab-delimited file:

  • Column 1: lane

  • Column 2: rank (ignored when reading in)

  • Column 3: barcode sequence

  • Column 4: counts

Older ‘counts’ files are three column tab-delimited with the following columns:

  • Column 1: rank (ignored when reading in)

  • Column 2: barcode sequence

  • Column 3: counts

In either case if the lane is missing or cannot be interpreted as an integer then it’s set to be ‘None’.

write(filen)

Write barcode data to a file

class auto_process_ngs.barcodes.analysis.BarcodeGroup(barcode, counts=0)

Class for storing groups of related barcodes

A group stores a representative ‘reference’ sequence and none or more related sequences, along with the total counts for all the sequences combined.

Create a group:

>>> grp = BarcodeGroup('GCTACGCTCTAAGCCT',2894178)

Add a barcode to the group if it’s related:

>>> if grp.match('GCTCCGCTCTAAGCCT'):
...   grp.add('GCTCCGCTCTAAGCCT',94178)

List the barcodes in the group:

>>> grp.sequences

Get the total number of counts:

>>> grp.counts

Retrieve the reference barcode:

>>> grp.reference
add(seq, counts)

Add a sequence to the group

Parameters:
  • seq (str) – sequence to add to the group

  • counts (int) – count for that sequence (will be added to the total for the group)

property counts

Return the total number of counts for all sequences

match(seq, mismatches=2)

Check if a sequence matches the reference

The supplied sequence is checked against the stored reference, and is a match if the number of mismatching positions are less than or equal to the number of allowed mismatches.

Note that:

  • for dual index sequences and references (i.e. sequences which contain either a ‘+’ or ‘-’ character to separate the indices within the sequence), each index is checked separately and the mismatch limit is applied per index (i.e. not across the sequence as a whole);

  • positions with an ‘N’s in either sequence automatically count as a mismatch in that position;

  • sequences which differ in length automatically fail to match.

Parameters:
  • seq (str) – sequence to check against the reference sequence

  • mismatches (int) – maximum number of mismatches that are allowed for the input sequence to be considered to match the reference (default is 2).

Returns:

True if sequences match within the

specified tolerance, False otherwise (or if sequence lengths differ)

Return type:

Boolean

property reference

Return the reference sequence for the group

property sequences

List the barcode sequences in the group

class auto_process_ngs.barcodes.analysis.Reporter

Class for generating reports of barcode statistics

Add arbitrary blocks of text with optional keyword ‘tags’, which can then be written to a text file, stream, or as an XLS file.

Usage:

Make a new Reporter:

>>> r = Reporter()

Add a title:

>>> r.add("This is the title",title=True)

Add a heading:

>>> r.add("A heading",heading=True)

Add some text:

>>> r.add("Lorem ipsum")

Write to file:

>>> r.write("report.txt")

Write as XLS:

>>> r.write_xls("report.xls")

Write as HTML:

>>> r.write_html("report.html",title="Barcodes")
add(content, **kws)

Add content to the report

Supplied content is appended to the existing content.

Also arbitrary keyword-value parts can be associated with the content.

property has_warnings

Report whether warnings were found in analysis

write(fp=None, filen=None, title=None)

Write the report to a file or stream

write_html(html_file, title=None, no_styles=False)

Write the report to a HTML file

write_xls(xls_file, title=None)

Write the report to an XLS file

class auto_process_ngs.barcodes.analysis.SampleSheetBarcodes(sample_sheet_file)

Class for handling index sequences from a sample sheet

Given a SampleSheet.csv file this class can extract the index sequences (aka barcodes) corresponding to sample names, and provides methods to look up one from the other.

(Note that for dual index sequences the indices are joined with a ‘+’ character, so for example ‘AGCCCTT’ and ‘GTTACAT’ becomes ‘AGCCCTT+GTTACAT’.)

Create an initial lookup object:

>>> s = SampleSheetBarcodes('SampleSheet.csv')

Get a list of barcodes in lane 2:

>>> s.barcodes(1)

Look up the sample name in lane 2 matching a barcode:

>>> s.lookup_sample('ATTGTG',2)

Look up the sample name matching a barcode in all lanes (e.g. if the sample sheet doesn’t include explicit lane information):

>>> s.lookup_sample('ATTGTG')
barcodes(lane=None)

Return a list of index sequences

If a lane is specified then a list of normalised barcode sequences for that lane is returned; if no lane is specified (or lane is ‘None’) then all barcode sequences are returned.

Parameters:

lane (int) – lane to restrict barcodes to, or None to get all barcode sequences

lookup_barcode(sample, lane=None)

Return normalised barcode sequence matching sample name

Parameters:
  • sample (str) – sample name to get normalised barcode sequence for

  • lane (int) – optional, lane to look for matching barcode in

lookup_sample(barcode, lane=None)

Return sample name matching barcode sequence

Parameters:
  • barcode (str) – normalised barcode sequence to get sample name for

  • lane (int) – optional, lane to look for matching sample in

samples(lane=None)

Return a list of sample names

If a lane is specified then a list of sample names that lane is returned; if no lane is specified (or lane is ‘None’) then all barcode sequences are returned.

Parameters:

lane (int) – lane to restrict sample names to, or None to get all sample names

auto_process_ngs.barcodes.analysis.detect_barcodes_warnings(report_file)

Look for warning text in barcode.report file

Parameters:

report_file (str) – path to barcode report file

Returns:

True if warnings were found, False if not.

Return type:

Boolean

auto_process_ngs.barcodes.analysis.make_title(text, underline='=')

Turn a string into a Markdown/rst title

Parameters:
  • text (str) – text to make into title

  • underline (str) – underline character (defaults to ‘=’)

Returns:

title text.

Return type:

String

auto_process_ngs.barcodes.analysis.percent(num, denom)

Return values as percentage

Parameters:
  • num (float) – number to express as percentage

  • denom (float) – denominator

Returns:

value expressed as a percentage.

Return type:

Float

auto_process_ngs.barcodes.analysis.report_barcodes(counts, lane=None, sample_sheet=None, cutoff=None, mismatches=0, minimum_read_fraction=1e-06, reporter=None)

Report barcode statistics

Parameters:
  • counts (BarcodeCounter) – BarcodeCounter instance with barcode data to report on

  • lane (integer) – optional, restrict report to the specified lane (‘None’ reports across all lanes)

  • sample_sheet (str) – optional, sample sheet file to match actual barcodes/groups against

  • mismatches (integer) – optional, maximum number of mismatches to allow for grouping; zero means there will be no grouping (default)

  • cutoff (float) – optional, decimal fraction representing minimum percentage of total reads that must be associated with a barcode in order to be included in analyses (e.g. 0.001 = 0.1%). Default is to include all barcodes

  • minimum_read_fraction – speed-up parameter, excludes barcodes with less than this fraction of associated reads (speeds up the grouping calculation at the cost of some precision)

  • reporter (Reporter) – Reporter instance to write results to for reporting (optional, default is to write to stdout)

Returns:

Reporter instance

Return type:

Reporter