auto_process_ngs.icell8.utils

icell8.utils.py

Utility classes and functions for processing the outputs from the ICELL8 single-cell platform.

Classes:

  • ICell8WellList: class representing ICELL8 well list file

  • ICell8Read1: class representing an ICELL8 R1 read

  • ICell8ReadPair: class representing an ICELL8 R1/R2 read-pair

  • ICell8FastqIterator: class for iterating over ICELL8 R1/R2 FASTQ-pair

  • ICell8Stats: class for gathering stats from ICELL8 FASTQ pairs

Functions:

  • get_batch_size: get optimal size for batches of reads

  • batch_fastqs: split reads into batches

  • normalize_sample_name: replace special characters in well list sample names

  • get_bases_mask_icell8: generate bases mask for ICELL8 run

  • get_bases_mask_icell8_atac: generate bases mask for ICELL8 ATAC-seq run

class auto_process_ngs.icell8.utils.ICell8FastqIterator(fqr1, fqr2)

Class for iterating over an ICELL8 R1/R2 FASTQ-pair

The iterator returns a set of ICell8ReadPair instances, for example:

>>> for pair in ICell8FastqIterator(fq1,fq2):
>>>   print("-- R1: %s" % pair.r1)
>>>   print("   R2: %s" % pair.r2)
class auto_process_ngs.icell8.utils.ICell8Read1(fastq_read)

Class representing an ICELL8 R1 read

property barcode

Inline barcode sequence extracted from the R1 read

property barcode_quality

Inline barcode sequence quality extracted from the R1 read

property min_barcode_quality

Minimum inline barcode quality score

The score is encoded as a character e.g. ‘/’ or ‘A’.

property min_umi_quality

Minimum UMI sequence quality score

The score is encoded as a character e.g. ‘/’ or ‘A’.

property read

R1 read

property umi

UMI sequence extracted from the R1 read

property umi_quality

UMI sequence quality extracted from the R1 read

class auto_process_ngs.icell8.utils.ICell8ReadPair(r1, r2)

Class representing an ICELL8 R1/R2 read-pair

property r1

R1 read from the pair

property r2

R2 read from the pair

class auto_process_ngs.icell8.utils.ICell8Stats(*fastqs, **kws)

Class for gathering statistics on ICELL8 FASTQ R1 files

Given a set of paths to FASTQ R1 files (from Icell8 Fastq file pairs), collects statistics on the number of reads, barcodes and distinct UMIs.

NB the list of distinct UMIs are where each UMI appears only once. Each UMI may appear multiple times across the FASTQ files.

barcodes()

Return list of barcodes from the FASTQs

distinct_umis(barcode=None)

Return all distinct UMIs, or by barcode

Invoked without arguments, returns a list of distinct UMIs found across the files. If a barcode is specified then returns a list of UMIs associated with that barcode.

Parameters:

barcode (str) – optional, specify barcode for which the list of distinct UMIs will be returned.

Returns:

list of distinct UMI sequences.

Return type:

List

nreads(barcode=None)

Return total number of reads, or per barcode

Invoked without arguments, returns the total number of reads analysed. If a barcode is specified then returns the number of reads with that barcode.

Parameters:

barcode (str) – optional, specify barcode for which the read count will be returned.

Returns:

number of reads.

Return type:

Integer

class auto_process_ngs.icell8.utils.ICell8StatsCollector(verbose=False)

Class to collect ICELL8 barcode and UMI counts

This class essentially wraps a single function which gets ICELL8 barcodes and distinct UMI counts from a Fastq file. It is used by the Icell8Stats class to collect counts for each file supplied.

Example usage:

>>> collector = ICell8StatsCollector()
>>> fq,counts,barcodes = collector(fastq)

By default the collection process is (relatively) quiet; more verbose output can be requested by setting the verbose argument to True on instantiation.

The collector has been implemented as a callable class so that it can be used with both the built-in map function and Pool.map from the Python multiprocessing module. (Specifically, this implementation works around issues with multiprocessing being unable to pickle an instance method - otherwise we could use e.g.

>>> pool.map(collector.collect_fastq_stats,fastqs)

See the question at https://stackoverflow.com/q/1816958/579925 and specifically the answer at https://stackoverflow.com/a/6975654/579925 for more elaboration.)

collect_fastq_stats(fastq)

Get barcode and distinct UMI counts for Fastq file

This method can be called directly, but is also invoked implicitly if its parent instance is called.

Parameters:

fastq (str) – path to Fastq file

Returns:

tuple consisting of (fastq,counts,umis)

where ‘fastq’ is the path to the input Fastq file, ‘counts’ is a dictionary with barcodes as keys and read counts as values, and ‘umis’ is a dictionary with barcodes as keys and sets of UMIs as values.

Return type:

Tuple

class auto_process_ngs.icell8.utils.ICell8WellList(well_list_file)

Class representing an ICELL8 well list file

The file is tab-delimited and consists of an uncommented header line which lists the fields (‘Row’,’Col’,’Candidate’,…), followed by lines of data.

The key columns are ‘Sample’ (gives the cell type) and ‘Barcode’ (the inline barcode sequence).

barcodes()

Return a list of barcodes

sample(barcode)

Return sample (=cell type) corresponding to barcode

samples()

Return a list of samples

auto_process_ngs.icell8.utils.batch_fastqs(fastqs, batch_size, basename='batched', out_dir=None)

Splits reads from one or more Fastqs into batches

Concatenates input Fastq files and then splits reads into smaller Fastqs using the external ‘batch’ utility.

Parameters:
  • fastqs (list) – list of paths to one or more Fastq files to take reads from

  • batch_size (int) – number of reads to allocate to each batch

  • basename (str) – optional basename to use for the output Fastq files (default: ‘batched’)

  • out_dir (str) – optional path to a directory where the batched Fastqs will be written

auto_process_ngs.icell8.utils.get_bases_mask_icell8(bases_mask, sample_sheet=None)

Reset the supplied bases mask string so that only the bases containing the inline barcode and UMIs are kept, and any remaining bases are ignored.

If a sample sheet is also supplied then an additional update will be made to ensure that the bases mask respects the barcode lengths given there.

Parameters:
  • bases_mask (str) – initial bases mask string to update

  • sample_sheet (str) – path to optional sample sheet

Returns:

updated bases mask string

Return type:

String

auto_process_ngs.icell8.utils.get_bases_mask_icell8_atac(runinfo_xml)

Acquire a bases mask for ICELL8 scATAC-seq

Generates an initial bases mask based on the run contents, and then updates this so that only the first 8 bases of each of index reads are used.

Parameters:

runinfo_xml (str) – path to the RunInfo.xml for the sequencing run

Returns:

ICELL8 scATAC-seq bases mask string

Return type:

String

auto_process_ngs.icell8.utils.get_batch_size(fastqs, min_batches=1, max_batch_size=100000000, incr_function=None)

Determine number of reads per batch

Given a maximum batch size (i.e. number of reads per batch), determine the number of batches and actual batch size.

Parameters:
  • fastqs (list) – list of paths to one or more Fastq files to take reads from

  • min_batches (int) – initial minimum number of batches to try

  • max_batch_size (int) – the maxiumum batch size

  • incr_function (Function) – optional function to use to generate new number of batches to try

Returns:

tuple of (batch_size,nbatches).

Return type:

Tuple

auto_process_ngs.icell8.utils.normalize_sample_name(s)

Clean up sample name from well list file

Replaces ‘illegal’ characters in the supplied name with underscore characters and returns the normalized name.

Parameters:

s (str) – sample name from well list file

Returns:

normalized sample name

Return type:

String

auto_process_ngs.icell8.utils.pass_quality_filter(s, cutoff)

Check if sequence passes quality filter cutoff

Parameters:
  • s (str) – sequence quality scores (PHRED+33)

  • cutoff (int) – minimum quality value; all quality scores must be equal to or greater than this value for the filter to pass

Returns:

True if the quality scores pass the

filter cutoff, False if not.

Return type:

Boolean