auto_process_ngs.samplesheet_utils
samplesheet_utils.py
Utilities for handling SampleSheet files:
SampleSheetLinter: core class which provides methods for checking sample sheet contents for potential problems
predict_outputs: generate expected outputs in human-readable form
check_and_warn: check sample sheet for problems and issue warnings
Helper functions:
has_invalid_characters: check if a file contains invalid characters
get_close_names: return closely matching names from a list
set_samplesheet_column: update values in sample sheet columns
- class auto_process_ngs.samplesheet_utils.SampleSheetLinter(sample_sheet=None, sample_sheet_file=None, fp=None)
Class for checking sample sheets for problems
Provides the following methods for checking different aspects of a sample sheet:
close_project_names: check if sample sheet projects look similar
samples_with_multiple_barcodes: check for samples with multiple barcodes
samples_in_multiple_projects: check for samples assigned to multiple projects
has_invalid_lines: check for invalid sample sheet lines
has_invalid_characters: check if sample sheet contains invalid characters
Example usage:
Initialise linter: >>> linter = SampleSheetLinter(sample_sheet_file=”SampleSheet.txt”)
Get closely-matching names: >>> linter.close_project_names() …
- close_project_names()
Return list of closely-matching project names
- Returns:
- keys are project names which have at least one
close match; the values for each key are lists with the project names which are close matches.
- Return type:
Dictionary
- has_invalid_barcodes()
Return list of lines with invalid barcodes
- Returns:
- list of lines which contain invalid barcode
sequences in the sample sheet.
- Return type:
- has_invalid_characters()
Check if text file contains any ‘invalid’ characters
In this context a character is ‘invalid’ if: - it is non-ASCII (decimal code > 127), or - it is a non-printing ASCII character (code < 32)
- Returns:
- True if file contains at least one invalid
character, False if all characters are valid.
- Return type:
Boolean
- has_invalid_lines()
Return list of samplesheet lines which are invalid
- Returns:
- list of lines which are invalid (i.e. missing
required data) in the sample sheet.
- Return type:
- samples_in_multiple_projects()
Return list of samples which are in multiple projects
- Returns:
- dictionary with sample IDs which appear in
multiple projects as keys; the associated values are lists with the project names.
- Return type:
Dictionary
- samples_with_multiple_barcodes()
Return list of samples which have multiple associated barcodes
- Returns:
keys are sample IDs which have more than one associated barcode; the values for each key are lists of the associated barcodes.
- Return type:
Dictionary
- walk()
Traverse the list of projects and samples
Generator that yields tuples consisting of (SampleSheetProject,SampleSheetSample) pairs
- Yields:
Tuple – SampleSheetProject, SampleSheetSample pair
- auto_process_ngs.samplesheet_utils.barcode_is_10xgenomics(s)
Check if sample sheet barcode is 10xGenomics sample set ID
10xGenomics sample set IDs of the form e.g. ‘SI-P03-C9’ or ‘SI-GA-B3’ are also considered to be valid.
- Parameters:
s (str) – barcode sequence to validate
- Returns:
- True if barcode is 10xGenomics sample set ID,
False if not.
- Return type:
Boolean
- auto_process_ngs.samplesheet_utils.barcode_is_valid(s)
Check if a sample sheet barcode sequence is valid
Valid barcodes must consist of only the letters A,T,G or C in any order, and always uppercase.
10xGenomics sample set IDs of the form e.g. ‘SI-P03-C9’ or ‘SI-GA-B3’ are also considered to be valid.
- Parameters:
s (str) – barcode sequence to validate
- Returns:
True if barcode is valid, False if not.
- Return type:
Boolean
- auto_process_ngs.samplesheet_utils.check_and_warn(sample_sheet=None, sample_sheet_file=None)
Check for sample sheet problems and issue warnings
The following checks are performed:
closely matching project names
samples with more than one barcode assigned
samples associated with more than one project
invalid lines
invalid characters
invalid barcodes
- Parameters:
sample_sheet (SampleSheet) – if supplied then must be a populated
SampleSheet
instance (ifNone
then data will be loaded from file specified bysample_sheet_file
)sample_sheet_file (str) – if
sample_sheet
isNone
then read data from the file specified by this argument
- Returns:
True if problems were identified, False otherwise.
- Return type:
Boolean
- auto_process_ngs.samplesheet_utils.get_close_names(names)
Given a list of names, find pairs which are similar
- Returns:
- keys are names which have at least one close
match; the values for each key are lists with the close matches.
- Return type:
Dictionary
- auto_process_ngs.samplesheet_utils.has_invalid_characters(filen=None, text=None)
Check if text file contains any ‘invalid’ characters
In this context a character is ‘invalid’ if: - it is non-ASCII (decimal code > 127), or - it is a non-printing ASCII character (code < 32)
- Returns:
- True if file contains at least one invalid
character, False if all characters are valid.
- Return type:
Boolean
- auto_process_ngs.samplesheet_utils.predict_outputs(sample_sheet=None, sample_sheet_file=None)
Generate expected sample sheet output in human-readable form
- Parameters:
sample_sheet (SampleSheet) – if supplied then must be a populated
SampleSheet
instance (ifNone
then data will be loaded from file specified bysample_sheet_file
)sample_sheet_file (str) – if
sample_sheet
isNone
then read data from the file specified by this argument
- Returns:
- text describing the expected projects, sample names,
indices, barcodes and lanes.
- Return type:
String
- auto_process_ngs.samplesheet_utils.set_samplesheet_column(sample_sheet, column, new_value, lanes=None, where=None)
Set the values in a column (optionally to a subset of lines)
Sets the values in a sample sheet column to a new value, which can be either a string, or the value of another column in the same line (by specifying ‘SAMPLE_PROJECT’, ‘SAMPLE_NAME’ or ‘SAMPLE_ID’ as the value).
By default the value will updated for all lines in the sample sheet, however a subset of lines can be selected by specifying a list of lane numbers (only for sample sheets which have a ‘Lane’ column), and/or where the value in another column matches a supplied glob-style pattern).
- Parameters:
sample_sheet (str) – either the path to a Sample Sheet file or a ‘SampleSheet’ instance
column (str) – name of the column to update, must be either the actual name of a column or one of the special values ‘SAMPLE_PROJECT’, ‘SAMPLE_NAME’ or ‘SAMPLE_ID’
new_value (str) – the new value to set the column to, or one of the special values ‘SAMPLE_PROJECT’, ‘SAMPLE_NAME’ or ‘SAMPLE_ID’ (to use the value from another column)
lanes (list) – if specified then selects the subset of lines where the ‘Lane’ number matches one in the list
where (tuple) – if specified then should be a tuple of the form ‘(col,pattern)’ where ‘col’ is a column name (or special value for a column) and ‘pattern’ is a glob-style pattern, which will be used to select a subset of lines
- Returns:
- instance of the ‘SampleSheet’ class with
the updated data.
- Return type:
SampleSheet