auto_process_ngs.bcl2fastq.utils
bcl2fastq/utils.py
Utility functions for bcl to fastq conversion operations:
get_sequencer_platform: get sequencing instrument platform
get_run_config: report configuration of reads in the run
available_bcl2fastq_versions: list available bcl2fastq converters
bcl_to_fastq_info: retrieve information on the bcl2fastq software
bclconvert_info: retrieve information on the BCL Convert software
make_custom_sample_sheet: create a corrected copy of a sample sheet file
get_required_samplesheet_format: fetch format required by bcl2fastq version
get_bases_mask: get a bases mask string
bases_mask_is_valid: check if bases mask string is valid
get_nmismatches: determine number of mismatches from bases mask
convert_bases_mask_to_override_cycles: convert bases mask for BCL convert
check_barcode_collisions: look for too-similiar pairs of barcode sequences
- auto_process_ngs.bcl2fastq.utils.available_bcl2fastq_versions(reqs=None, paths=None)
List available bcl2fastq converters
By default searches the PATH for likely bcl2fastq converters and returns a list of executables with the full path.
The ‘reqs’ argument allows a specific version or range of versions to be requested; in this case the returned list will only contain those packages which satisfy the requested versions.
A range of version specifications can be requested by separating multiple specifiers with a comma - for example ‘>1.8.3,<2.16’.
The full set of operators is:
==, >, >=, <=, <
If no versions are requested then the packages will be returned in PATH order; otherwise they will be returned in version order (highest to lowest).
- Parameters:
reqs (str) – optional version requirement expression (for example ‘>=1.8.4’). If supplied then only executables fulfilling the requirement will be returned. If no operator is supplied then ‘==’ is implied.
paths (list) – optional set of directory paths to search when looking for bcl2fastq software. If not supplied then the set of paths specified in the PATH environment variable will be searched.
- Returns:
full paths to bcl2fastq converter executables.
- Return type:
- auto_process_ngs.bcl2fastq.utils.bases_mask_is_valid(bases_mask)
Check if a bases mask is valid
- Parameters:
bases_mask – bases mask string to check
- Returns:
- True if the supplied bases mask is valid,
False if not.
- Return type:
Boolean
- auto_process_ngs.bcl2fastq.utils.bcl_to_fastq_info(path=None)
Retrieve information on the bcl2fastq software
If called without any arguments this will locate the first bcl-to-fastq conversion package executable (either ‘configureBclToFastq.pl’ or ‘bcl2fastq’) that is available on the user’s PATH (as returned by ‘available_bcl2fastq_versions’) and attempts to guess the package name (either bcl2fastq or CASAVA) and the version that it belongs to.
Alternatively if the path to an executable is supplied then the package name and version will be determined from that instead.
If no package is identified then the script path is still returned, but without any version info.
- Returns:
- tuple consisting of (PATH,PACKAGE,VERSION) where PATH
is the full path for the bcl2fastq program or configureBclToFastq.pl script and PACKAGE and VERSION are guesses for the package/version that it belongs to. If any value can’t be determined then it will be returned as an empty string.
- Return type:
Tuple
- auto_process_ngs.bcl2fastq.utils.bclconvert_info(path=None)
Retrieve information on the bcl-convert software
If called without any arguments this will locate the first bcl-concert executable that is available on the user’s PATH.
Alternatively if the path to an executable is supplied then the package name and version will be determined from that instead.
If no package is identified then the script path is still returned, but without any version info.
- Returns:
- tuple consisting of (PATH,PACKAGE,VERSION) where PATH
is the full path for the bcl-convert program, and PACKAGE and VERSION the package/version that it belongs to (PACKAGE will be ‘BCL Convert’ if a matching executable is located). If any value can’t be determined then it will be returned as an empty string.
- Return type:
Tuple
- auto_process_ngs.bcl2fastq.utils.check_barcode_collisions(sample_sheet_file, nmismatches, use_index='all')
Check sample sheet for barcode collisions
Check barcode index sequences within each lane (or across all samples, if no lane information is present) and find any which differ in fewer bases than a threshold number which is calculated as:
less than 2 times the number of mismatches plus 1
(as is stated in the output from bcl2fastq v2.)
Pairs of barcodes which are too similar (i.e. which collide) are reported as a list of tuples, e.g.
[(‘ATTCCT’,’ATTCCG’),…]
- Parameters:
sample_sheet_file (str) – path to a SampleSheet.csv file to analyse for barcode collisions
nmismatches (int) – maximum number of mismatches to allow
use_index (str) – flag indicating how to treat index sequences: ‘all’ (the default) combines indexes into a single sequence before checking for collisions, ‘1’ only checks index 1 (i7), and ‘2’ only checks index 2 (i5)
- Returns:
- list of pairs of colliding barcodes (with each pair
wrapped in a tuple), or an empty list if no collisions were detected.
- Return type:
- auto_process_ngs.bcl2fastq.utils.convert_bases_mask_to_override_cycles(bases_mask)
Converts bcl2fastq-format bases mask to BCL Convert format
Given a bases mask string (e.g. ‘y76,I8,I8,y76’), returns the equivalent BCL Convert format for use with ‘OverrideCycles’ in a sample sheet (e.g. ‘Y76;I8;I8;Y76’).
- Parameters:
bases_mask (str) – bcl2fastq bases mask string
- Returns:
- the original bases mask converted to BCL Convert
format
- Return type:
String
- auto_process_ngs.bcl2fastq.utils.get_bases_mask(run_info_xml, sample_sheet_file=None)
Get bases mask string
Generates initial bases mask based on data in RunInfo.xml (which says how many reads there are, how many cycles in each read, and which are index reads), and optionally updates this using the barcode information in the sample sheet file.
- Parameters:
run_info_xml – name and path of RunInfo.xml file from the sequencing run
sample_sheet_file – (optional) path to sample sheet file
- Returns:
Bases mask string e.g. ‘y101,I6’.
- auto_process_ngs.bcl2fastq.utils.get_nmismatches(bases_mask, multi_index=False)
Determine number of mismatches from bases mask
Automatically determines the maximum number of mismatches that should be allowed for a bcl to fastq conversion run, based on the tag length i.e. the length of the index barcode sequences.
Tag lengths of 6 or more use 1 mismatch, otherwise use zero mismatches.
The number of mismatches should be supplied to the bclToFastq conversion process.
Raises an exception if the supplied bases mask is not valid.
- Parameters:
bases_mask – bases mask string of the form e.g. ‘y101,I6,y101’
multi_index – boolean flag, if False (default) then use the total length of all indices and return a single integer number of allowed mismatches; if True then return a list with the number of mismatches for each index (so a dual index will be a pair of allowed mismatches)
- Returns:
- Integer value of number of mismatches. (If the bases mask doesn’t
contain any index reads then returns zero for single-index mode, or an empty list for multi-read mode.)
- auto_process_ngs.bcl2fastq.utils.get_required_samplesheet_format(bcl2fastq_version)
Returns sample sheet format required by bcl2fastq
Given a bcl2fastq version, returns the format of the sample sheet that is required for that version.
- Parameters:
bcl2fastq_version (str) – version of bcl2fastq
- Returns:
- Sample sheet format (e.g. ‘CASAVA’, ‘IEM’
etc).
- Return type:
String
- auto_process_ngs.bcl2fastq.utils.get_run_config(run_info_xml)
Get string describing run configuration
Generates a run configuration string based on data in RunInfo.xml (which says how many reads there are, how many cycles in each read, and which are index reads).
An example run configuration string might look like ‘R1:59bp,I1:10bp,I2:10bp,R2:59bp’.
- Parameters:
run_info_xml – name and path of RunInfo.xml file from the sequencing run
- Returns:
run configuration string.
- Return type:
String
- auto_process_ngs.bcl2fastq.utils.get_sequencer_platform(dirn, instrument=None, settings=None)
Return the platform for the sequencing instrument
Attempts to identify the platform (e.g. ‘hiseq’, ‘miseq’ etc) for a sequencing run.
If ‘settings’ is supplied then the platform is looked up based on the instrument names and platforms listed in the ‘sequencers’ section of the configuration. If ‘instrument’ is also supplied then this is used; otherwise the instrument name is extracted from the supplied directory name.
If no match can be found then there is a final attempt to determine the platform from the hard-coded names in the ‘bcftbx.platforms’ module.
- Parameters:
dirn (str) – path to the data or analysis directory
instrument (str) – (optional) the instrument name
settings (Settings) – (optional) a Settings instance with the configuration loaded
- Returns:
- either the platform or None, if the platform
cannot be determined.
- Return type:
String
- auto_process_ngs.bcl2fastq.utils.make_custom_sample_sheet(input_sample_sheet, output_sample_sheet=None, lanes=None, adapter=None, adapter_read2=None, fmt=None)
Creates a corrected copy of a sample sheet file
Creates and returns a SampleSheet object with a copy of the input sample sheet, with any illegal or duplicated names fixed. Optionally it can also: write the updated sample sheet data to a new file, switch the format, and include only a subset of lanes from the original file
- Parameters:
input_sample_sheet (str) – name and path of the original sample sheet file
output_sample_sheet (str) – (optional) name and path to write updated sample sheet to, or None
lanes (list) – (optional) list of lane numbers to keep in the output sample sheet; if None then all lanes will be kept (the default), otherwise lanes will be dropped if they don’t appear in the supplied list
adapter (str) – (optional) if set then write to the Adapter setting
adapter_read2 (str) – (optional) if set then write to the AdapterRead2 setting
fmt (str) – (optional) format for the output sample sheet, either ‘CASAVA’ or ‘IEM’; if this is None then the format of the original file will be used
- Returns:
SampleSheet object with the data for the corrected sample sheet.