auto_process_ngs.bcl2fastq.utils

bcl2fastq/utils.py

Utility functions for bcl to fastq conversion operations:

  • get_sequencer_platform: get sequencing instrument platform

  • get_run_config: report configuration of reads in the run

  • available_bcl2fastq_versions: list available bcl2fastq converters

  • bcl_to_fastq_info: retrieve information on the bcl2fastq software

  • bclconvert_info: retrieve information on the BCL Convert software

  • make_custom_sample_sheet: create a corrected copy of a sample sheet file

  • get_required_samplesheet_format: fetch format required by bcl2fastq version

  • get_bases_mask: get a bases mask string

  • bases_mask_is_valid: check if bases mask string is valid

  • get_nmismatches: determine number of mismatches from bases mask

  • convert_bases_mask_to_override_cycles: convert bases mask for BCL convert

  • check_barcode_collisions: look for too-similiar pairs of barcode sequences

auto_process_ngs.bcl2fastq.utils.available_bcl2fastq_versions(reqs=None, paths=None)

List available bcl2fastq converters

By default searches the PATH for likely bcl2fastq converters and returns a list of executables with the full path.

The ‘reqs’ argument allows a specific version or range of versions to be requested; in this case the returned list will only contain those packages which satisfy the requested versions.

A range of version specifications can be requested by separating multiple specifiers with a comma - for example ‘>1.8.3,<2.16’.

The full set of operators is:

  • ==, >, >=, <=, <

If no versions are requested then the packages will be returned in PATH order; otherwise they will be returned in version order (highest to lowest).

Parameters:
  • reqs (str) – optional version requirement expression (for example ‘>=1.8.4’). If supplied then only executables fulfilling the requirement will be returned. If no operator is supplied then ‘==’ is implied.

  • paths (list) – optional set of directory paths to search when looking for bcl2fastq software. If not supplied then the set of paths specified in the PATH environment variable will be searched.

Returns:

full paths to bcl2fastq converter executables.

Return type:

List

auto_process_ngs.bcl2fastq.utils.bases_mask_is_valid(bases_mask)

Check if a bases mask is valid

Parameters:

bases_mask – bases mask string to check

Returns:

True if the supplied bases mask is valid,

False if not.

Return type:

Boolean

auto_process_ngs.bcl2fastq.utils.bcl_to_fastq_info(path=None)

Retrieve information on the bcl2fastq software

If called without any arguments this will locate the first bcl-to-fastq conversion package executable (either ‘configureBclToFastq.pl’ or ‘bcl2fastq’) that is available on the user’s PATH (as returned by ‘available_bcl2fastq_versions’) and attempts to guess the package name (either bcl2fastq or CASAVA) and the version that it belongs to.

Alternatively if the path to an executable is supplied then the package name and version will be determined from that instead.

If no package is identified then the script path is still returned, but without any version info.

Returns:

tuple consisting of (PATH,PACKAGE,VERSION) where PATH

is the full path for the bcl2fastq program or configureBclToFastq.pl script and PACKAGE and VERSION are guesses for the package/version that it belongs to. If any value can’t be determined then it will be returned as an empty string.

Return type:

Tuple

auto_process_ngs.bcl2fastq.utils.bclconvert_info(path=None)

Retrieve information on the bcl-convert software

If called without any arguments this will locate the first bcl-concert executable that is available on the user’s PATH.

Alternatively if the path to an executable is supplied then the package name and version will be determined from that instead.

If no package is identified then the script path is still returned, but without any version info.

Returns:

tuple consisting of (PATH,PACKAGE,VERSION) where PATH

is the full path for the bcl-convert program, and PACKAGE and VERSION the package/version that it belongs to (PACKAGE will be ‘BCL Convert’ if a matching executable is located). If any value can’t be determined then it will be returned as an empty string.

Return type:

Tuple

auto_process_ngs.bcl2fastq.utils.check_barcode_collisions(sample_sheet_file, nmismatches, use_index='all')

Check sample sheet for barcode collisions

Check barcode index sequences within each lane (or across all samples, if no lane information is present) and find any which differ in fewer bases than a threshold number which is calculated as:

less than 2 times the number of mismatches plus 1

(as is stated in the output from bcl2fastq v2.)

Pairs of barcodes which are too similar (i.e. which collide) are reported as a list of tuples, e.g.

[(‘ATTCCT’,’ATTCCG’),…]

Parameters:
  • sample_sheet_file (str) – path to a SampleSheet.csv file to analyse for barcode collisions

  • nmismatches (int) – maximum number of mismatches to allow

  • use_index (str) – flag indicating how to treat index sequences: ‘all’ (the default) combines indexes into a single sequence before checking for collisions, ‘1’ only checks index 1 (i7), and ‘2’ only checks index 2 (i5)

Returns:

list of pairs of colliding barcodes (with each pair

wrapped in a tuple), or an empty list if no collisions were detected.

Return type:

List

auto_process_ngs.bcl2fastq.utils.convert_bases_mask_to_override_cycles(bases_mask)

Converts bcl2fastq-format bases mask to BCL Convert format

Given a bases mask string (e.g. ‘y76,I8,I8,y76’), returns the equivalent BCL Convert format for use with ‘OverrideCycles’ in a sample sheet (e.g. ‘Y76;I8;I8;Y76’).

Parameters:

bases_mask (str) – bcl2fastq bases mask string

Returns:

the original bases mask converted to BCL Convert

format

Return type:

String

auto_process_ngs.bcl2fastq.utils.get_bases_mask(run_info_xml, sample_sheet_file=None)

Get bases mask string

Generates initial bases mask based on data in RunInfo.xml (which says how many reads there are, how many cycles in each read, and which are index reads), and optionally updates this using the barcode information in the sample sheet file.

Parameters:
  • run_info_xml – name and path of RunInfo.xml file from the sequencing run

  • sample_sheet_file – (optional) path to sample sheet file

Returns:

Bases mask string e.g. ‘y101,I6’.

auto_process_ngs.bcl2fastq.utils.get_nmismatches(bases_mask, multi_index=False)

Determine number of mismatches from bases mask

Automatically determines the maximum number of mismatches that should be allowed for a bcl to fastq conversion run, based on the tag length i.e. the length of the index barcode sequences.

Tag lengths of 6 or more use 1 mismatch, otherwise use zero mismatches.

The number of mismatches should be supplied to the bclToFastq conversion process.

Raises an exception if the supplied bases mask is not valid.

Parameters:
  • bases_mask – bases mask string of the form e.g. ‘y101,I6,y101’

  • multi_index – boolean flag, if False (default) then use the total length of all indices and return a single integer number of allowed mismatches; if True then return a list with the number of mismatches for each index (so a dual index will be a pair of allowed mismatches)

Returns:

Integer value of number of mismatches. (If the bases mask doesn’t

contain any index reads then returns zero for single-index mode, or an empty list for multi-read mode.)

auto_process_ngs.bcl2fastq.utils.get_required_samplesheet_format(bcl2fastq_version)

Returns sample sheet format required by bcl2fastq

Given a bcl2fastq version, returns the format of the sample sheet that is required for that version.

Parameters:

bcl2fastq_version (str) – version of bcl2fastq

Returns:

Sample sheet format (e.g. ‘CASAVA’, ‘IEM’

etc).

Return type:

String

auto_process_ngs.bcl2fastq.utils.get_run_config(run_info_xml)

Get string describing run configuration

Generates a run configuration string based on data in RunInfo.xml (which says how many reads there are, how many cycles in each read, and which are index reads).

An example run configuration string might look like ‘R1:59bp,I1:10bp,I2:10bp,R2:59bp’.

Parameters:

run_info_xml – name and path of RunInfo.xml file from the sequencing run

Returns:

run configuration string.

Return type:

String

auto_process_ngs.bcl2fastq.utils.get_sequencer_platform(dirn, instrument=None, settings=None)

Return the platform for the sequencing instrument

Attempts to identify the platform (e.g. ‘hiseq’, ‘miseq’ etc) for a sequencing run.

If ‘settings’ is supplied then the platform is looked up based on the instrument names and platforms listed in the ‘sequencers’ section of the configuration. If ‘instrument’ is also supplied then this is used; otherwise the instrument name is extracted from the supplied directory name.

If no match can be found then there is a final attempt to determine the platform from the hard-coded names in the ‘bcftbx.platforms’ module.

Parameters:
  • dirn (str) – path to the data or analysis directory

  • instrument (str) – (optional) the instrument name

  • settings (Settings) – (optional) a Settings instance with the configuration loaded

Returns:

either the platform or None, if the platform

cannot be determined.

Return type:

String

auto_process_ngs.bcl2fastq.utils.make_custom_sample_sheet(input_sample_sheet, output_sample_sheet=None, lanes=None, adapter=None, adapter_read2=None, fmt=None)

Creates a corrected copy of a sample sheet file

Creates and returns a SampleSheet object with a copy of the input sample sheet, with any illegal or duplicated names fixed. Optionally it can also: write the updated sample sheet data to a new file, switch the format, and include only a subset of lanes from the original file

Parameters:
  • input_sample_sheet (str) – name and path of the original sample sheet file

  • output_sample_sheet (str) – (optional) name and path to write updated sample sheet to, or None

  • lanes (list) – (optional) list of lane numbers to keep in the output sample sheet; if None then all lanes will be kept (the default), otherwise lanes will be dropped if they don’t appear in the supplied list

  • adapter (str) – (optional) if set then write to the Adapter setting

  • adapter_read2 (str) – (optional) if set then write to the AdapterRead2 setting

  • fmt (str) – (optional) format for the output sample sheet, either ‘CASAVA’ or ‘IEM’; if this is None then the format of the original file will be used

Returns:

SampleSheet object with the data for the corrected sample sheet.