auto_process_ngs.samplesheet_utils

samplesheet_utils.py

Utilities for handling SampleSheet files:

  • SampleSheetLinter: core class which provides methods for checking sample sheet contents for potential problems

  • predict_outputs: generate expected outputs in human-readable form

  • check_and_warn: check sample sheet for problems and issue warnings

Helper functions:

  • has_invalid_characters: check if a file contains invalid characters

  • get_close_names: return closely matching names from a list

  • set_samplesheet_column: update values in sample sheet columns

class auto_process_ngs.samplesheet_utils.SampleSheetLinter(sample_sheet=None, sample_sheet_file=None, fp=None)

Class for checking sample sheets for problems

Provides the following methods for checking different aspects of a sample sheet:

  • close_project_names: check if sample sheet projects look similar

  • samples_with_multiple_barcodes: check for samples with multiple barcodes

  • samples_in_multiple_projects: check for samples assigned to multiple projects

  • has_invalid_lines: check for invalid sample sheet lines

  • has_invalid_characters: check if sample sheet contains invalid characters

Example usage:

Initialise linter: >>> linter = SampleSheetLinter(sample_sheet_file=”SampleSheet.txt”)

Get closely-matching names: >>> linter.close_project_names() …

close_project_names()

Return list of closely-matching project names

Returns:

keys are project names which have at least one

close match; the values for each key are lists with the project names which are close matches.

Return type:

Dictionary

has_invalid_barcodes()

Return list of lines with invalid barcodes

Returns:

list of lines which contain invalid barcode

sequences in the sample sheet.

Return type:

List

has_invalid_characters()

Check if text file contains any ‘invalid’ characters

In this context a character is ‘invalid’ if: - it is non-ASCII (decimal code > 127), or - it is a non-printing ASCII character (code < 32)

Returns:

True if file contains at least one invalid

character, False if all characters are valid.

Return type:

Boolean

has_invalid_lines()

Return list of samplesheet lines which are invalid

Returns:

list of lines which are invalid (i.e. missing

required data) in the sample sheet.

Return type:

List

samples_in_multiple_projects()

Return list of samples which are in multiple projects

Returns:

dictionary with sample IDs which appear in

multiple projects as keys; the associated values are lists with the project names.

Return type:

Dictionary

samples_with_multiple_barcodes()

Return list of samples which have multiple associated barcodes

Returns:

keys are sample IDs which have more than one associated barcode; the values for each key are lists of the associated barcodes.

Return type:

Dictionary

walk()

Traverse the list of projects and samples

Generator that yields tuples consisting of (SampleSheetProject,SampleSheetSample) pairs

Yields:

Tuple – SampleSheetProject, SampleSheetSample pair

auto_process_ngs.samplesheet_utils.barcode_is_10xgenomics(s)

Check if sample sheet barcode is 10xGenomics sample set ID

10xGenomics sample set IDs of the form e.g. ‘SI-P03-C9’ or ‘SI-GA-B3’ are also considered to be valid.

Parameters:

s (str) – barcode sequence to validate

Returns:

True if barcode is 10xGenomics sample set ID,

False if not.

Return type:

Boolean

auto_process_ngs.samplesheet_utils.barcode_is_valid(s)

Check if a sample sheet barcode sequence is valid

Valid barcodes must consist of only the letters A,T,G or C in any order, and always uppercase.

10xGenomics sample set IDs of the form e.g. ‘SI-P03-C9’ or ‘SI-GA-B3’ are also considered to be valid.

Parameters:

s (str) – barcode sequence to validate

Returns:

True if barcode is valid, False if not.

Return type:

Boolean

auto_process_ngs.samplesheet_utils.check_and_warn(sample_sheet=None, sample_sheet_file=None)

Check for sample sheet problems and issue warnings

The following checks are performed:

  • closely matching project names

  • samples with more than one barcode assigned

  • samples associated with more than one project

  • invalid lines

  • invalid characters

  • invalid barcodes

Parameters:
  • sample_sheet (SampleSheet) – if supplied then must be a populated SampleSheet instance (if None then data will be loaded from file specified by sample_sheet_file)

  • sample_sheet_file (str) – if sample_sheet is None then read data from the file specified by this argument

Returns:

True if problems were identified, False otherwise.

Return type:

Boolean

auto_process_ngs.samplesheet_utils.get_close_names(names)

Given a list of names, find pairs which are similar

Returns:

keys are names which have at least one close

match; the values for each key are lists with the close matches.

Return type:

Dictionary

auto_process_ngs.samplesheet_utils.has_invalid_characters(filen=None, text=None)

Check if text file contains any ‘invalid’ characters

In this context a character is ‘invalid’ if: - it is non-ASCII (decimal code > 127), or - it is a non-printing ASCII character (code < 32)

Returns:

True if file contains at least one invalid

character, False if all characters are valid.

Return type:

Boolean

auto_process_ngs.samplesheet_utils.predict_outputs(sample_sheet=None, sample_sheet_file=None)

Generate expected sample sheet output in human-readable form

Parameters:
  • sample_sheet (SampleSheet) – if supplied then must be a populated SampleSheet instance (if None then data will be loaded from file specified by sample_sheet_file)

  • sample_sheet_file (str) – if sample_sheet is None then read data from the file specified by this argument

Returns:

text describing the expected projects, sample names,

indices, barcodes and lanes.

Return type:

String

auto_process_ngs.samplesheet_utils.set_samplesheet_column(sample_sheet, column, new_value, lanes=None, where=None)

Set the values in a column (optionally to a subset of lines)

Sets the values in a sample sheet column to a new value, which can be either a string, or the value of another column in the same line (by specifying ‘SAMPLE_PROJECT’, ‘SAMPLE_NAME’ or ‘SAMPLE_ID’ as the value).

By default the value will updated for all lines in the sample sheet, however a subset of lines can be selected by specifying a list of lane numbers (only for sample sheets which have a ‘Lane’ column), and/or where the value in another column matches a supplied glob-style pattern).

Parameters:
  • sample_sheet (str) – either the path to a Sample Sheet file or a ‘SampleSheet’ instance

  • column (str) – name of the column to update, must be either the actual name of a column or one of the special values ‘SAMPLE_PROJECT’, ‘SAMPLE_NAME’ or ‘SAMPLE_ID’

  • new_value (str) – the new value to set the column to, or one of the special values ‘SAMPLE_PROJECT’, ‘SAMPLE_NAME’ or ‘SAMPLE_ID’ (to use the value from another column)

  • lanes (list) – if specified then selects the subset of lines where the ‘Lane’ number matches one in the list

  • where (tuple) – if specified then should be a tuple of the form ‘(col,pattern)’ where ‘col’ is a column name (or special value for a column) and ‘pattern’ is a glob-style pattern, which will be used to select a subset of lines

Returns:

instance of the ‘SampleSheet’ class with

the updated data.

Return type:

SampleSheet