auto_process_ngs.icell8.utils
icell8.utils.py
Utility classes and functions for processing the outputs from the ICELL8 single-cell platform.
Classes:
ICell8WellList: class representing ICELL8 well list file
ICell8Read1: class representing an ICELL8 R1 read
ICell8ReadPair: class representing an ICELL8 R1/R2 read-pair
ICell8FastqIterator: class for iterating over ICELL8 R1/R2 FASTQ-pair
ICell8Stats: class for gathering stats from ICELL8 FASTQ pairs
Functions:
get_batch_size: get optimal size for batches of reads
batch_fastqs: split reads into batches
normalize_sample_name: replace special characters in well list sample names
get_bases_mask_icell8: generate bases mask for ICELL8 run
get_bases_mask_icell8_atac: generate bases mask for ICELL8 ATAC-seq run
- class auto_process_ngs.icell8.utils.ICell8FastqIterator(fqr1, fqr2)
Class for iterating over an ICELL8 R1/R2 FASTQ-pair
The iterator returns a set of ICell8ReadPair instances, for example:
>>> for pair in ICell8FastqIterator(fq1,fq2): >>> print("-- R1: %s" % pair.r1) >>> print(" R2: %s" % pair.r2)
- class auto_process_ngs.icell8.utils.ICell8Read1(fastq_read)
Class representing an ICELL8 R1 read
- property barcode
Inline barcode sequence extracted from the R1 read
- property barcode_quality
Inline barcode sequence quality extracted from the R1 read
- property min_barcode_quality
Minimum inline barcode quality score
The score is encoded as a character e.g. ‘/’ or ‘A’.
- property min_umi_quality
Minimum UMI sequence quality score
The score is encoded as a character e.g. ‘/’ or ‘A’.
- property read
R1 read
- property umi
UMI sequence extracted from the R1 read
- property umi_quality
UMI sequence quality extracted from the R1 read
- class auto_process_ngs.icell8.utils.ICell8ReadPair(r1, r2)
Class representing an ICELL8 R1/R2 read-pair
- property r1
R1 read from the pair
- property r2
R2 read from the pair
- class auto_process_ngs.icell8.utils.ICell8Stats(*fastqs, **kws)
Class for gathering statistics on ICELL8 FASTQ R1 files
Given a set of paths to FASTQ R1 files (from Icell8 Fastq file pairs), collects statistics on the number of reads, barcodes and distinct UMIs.
NB the list of distinct UMIs are where each UMI appears only once. Each UMI may appear multiple times across the FASTQ files.
- barcodes()
Return list of barcodes from the FASTQs
- distinct_umis(barcode=None)
Return all distinct UMIs, or by barcode
Invoked without arguments, returns a list of distinct UMIs found across the files. If a barcode is specified then returns a list of UMIs associated with that barcode.
- Parameters:
barcode (str) – optional, specify barcode for which the list of distinct UMIs will be returned.
- Returns:
list of distinct UMI sequences.
- Return type:
- nreads(barcode=None)
Return total number of reads, or per barcode
Invoked without arguments, returns the total number of reads analysed. If a barcode is specified then returns the number of reads with that barcode.
- Parameters:
barcode (str) – optional, specify barcode for which the read count will be returned.
- Returns:
number of reads.
- Return type:
Integer
- class auto_process_ngs.icell8.utils.ICell8StatsCollector(verbose=False)
Class to collect ICELL8 barcode and UMI counts
This class essentially wraps a single function which gets ICELL8 barcodes and distinct UMI counts from a Fastq file. It is used by the Icell8Stats class to collect counts for each file supplied.
Example usage:
>>> collector = ICell8StatsCollector() >>> fq,counts,barcodes = collector(fastq)
By default the collection process is (relatively) quiet; more verbose output can be requested by setting the verbose argument to True on instantiation.
The collector has been implemented as a callable class so that it can be used with both the built-in map function and Pool.map from the Python multiprocessing module. (Specifically, this implementation works around issues with multiprocessing being unable to pickle an instance method - otherwise we could use e.g.
>>> pool.map(collector.collect_fastq_stats,fastqs)
See the question at https://stackoverflow.com/q/1816958/579925 and specifically the answer at https://stackoverflow.com/a/6975654/579925 for more elaboration.)
- collect_fastq_stats(fastq)
Get barcode and distinct UMI counts for Fastq file
This method can be called directly, but is also invoked implicitly if its parent instance is called.
- Parameters:
fastq (str) – path to Fastq file
- Returns:
- tuple consisting of (fastq,counts,umis)
where ‘fastq’ is the path to the input Fastq file, ‘counts’ is a dictionary with barcodes as keys and read counts as values, and ‘umis’ is a dictionary with barcodes as keys and sets of UMIs as values.
- Return type:
Tuple
- class auto_process_ngs.icell8.utils.ICell8WellList(well_list_file)
Class representing an ICELL8 well list file
The file is tab-delimited and consists of an uncommented header line which lists the fields (‘Row’,’Col’,’Candidate’,…), followed by lines of data.
The key columns are ‘Sample’ (gives the cell type) and ‘Barcode’ (the inline barcode sequence).
- barcodes()
Return a list of barcodes
- sample(barcode)
Return sample (=cell type) corresponding to barcode
- samples()
Return a list of samples
- auto_process_ngs.icell8.utils.batch_fastqs(fastqs, batch_size, basename='batched', out_dir=None)
Splits reads from one or more Fastqs into batches
Concatenates input Fastq files and then splits reads into smaller Fastqs using the external ‘batch’ utility.
- Parameters:
fastqs (list) – list of paths to one or more Fastq files to take reads from
batch_size (int) – number of reads to allocate to each batch
basename (str) – optional basename to use for the output Fastq files (default: ‘batched’)
out_dir (str) – optional path to a directory where the batched Fastqs will be written
- auto_process_ngs.icell8.utils.get_bases_mask_icell8(bases_mask, sample_sheet=None)
Reset the supplied bases mask string so that only the bases containing the inline barcode and UMIs are kept, and any remaining bases are ignored.
If a sample sheet is also supplied then an additional update will be made to ensure that the bases mask respects the barcode lengths given there.
- Parameters:
bases_mask (str) – initial bases mask string to update
sample_sheet (str) – path to optional sample sheet
- Returns:
updated bases mask string
- Return type:
String
- auto_process_ngs.icell8.utils.get_bases_mask_icell8_atac(runinfo_xml)
Acquire a bases mask for ICELL8 scATAC-seq
Generates an initial bases mask based on the run contents, and then updates this so that only the first 8 bases of each of index reads are used.
- Parameters:
runinfo_xml (str) – path to the RunInfo.xml for the sequencing run
- Returns:
ICELL8 scATAC-seq bases mask string
- Return type:
String
- auto_process_ngs.icell8.utils.get_batch_size(fastqs, min_batches=1, max_batch_size=100000000, incr_function=None)
Determine number of reads per batch
Given a maximum batch size (i.e. number of reads per batch), determine the number of batches and actual batch size.
- Parameters:
fastqs (list) – list of paths to one or more Fastq files to take reads from
min_batches (int) – initial minimum number of batches to try
max_batch_size (int) – the maxiumum batch size
incr_function (Function) – optional function to use to generate new number of batches to try
- Returns:
tuple of (batch_size,nbatches).
- Return type:
Tuple
- auto_process_ngs.icell8.utils.normalize_sample_name(s)
Clean up sample name from well list file
Replaces ‘illegal’ characters in the supplied name with underscore characters and returns the normalized name.
- Parameters:
s (str) – sample name from well list file
- Returns:
normalized sample name
- Return type:
String
- auto_process_ngs.icell8.utils.pass_quality_filter(s, cutoff)
Check if sequence passes quality filter cutoff
- Parameters:
s (str) – sequence quality scores (PHRED+33)
cutoff (int) – minimum quality value; all quality scores must be equal to or greater than this value for the filter to pass
- Returns:
- True if the quality scores pass the
filter cutoff, False if not.
- Return type:
Boolean