auto_process_ngs.analysis
Classes and functions for handling analysis directories and projects.
Classes:
AnalysisFastq: extract information from Fastq name
AnalysisDir: API for sequencing run analysis directory
AnalysisProject: API for project in an analysis directory
AnalysisSample: API for sample in an analysis project
Functions:
run_id: fetch run ID for sequencing run
split_sample_name: split sample name into components
split_sample_reference: split sample reference ID into components
match_run_id: check if directory matches run identifier
locate_run: search for an analysis directory by ID
locate_project: search for an analysis project by ID
locate_project_info_file: search for ‘project.info’ file
copy_analysis_project: make copy of an AnalysisDir
- class auto_process_ngs.analysis.AnalysisDir(analysis_dir)
Class describing an analysis directory
Conceptually an analysis directory maps onto a sequencing run. It consists of one or more sets of samples from that run, which are represented by subdirectories.
It is also possible to have one or more subdirectories containing outputs from the CASAVA or bclToFastq processing software.
Properties:
analysis_dir: full path to the directory
run_name: the name of the parent sequencing run
metadata: metadata items associated with the run
projects: list of AnalysisProject objects
undetermined: AnalysisProject object for ‘undetermined’
sequencing_data: list of IlluminaData objects
projects_metadata: metadata from the ‘projects.info’ file
datestamp: datestamp extracted from run name
instrument_name: instrument name extracted from run name
instrument_run_number: run number extracted from run name
n_projects: number of projects
n_sequencing_data: number of sequencing data directories
paired_end: whether data are paired ended
- Parameters:
analysis_dir (str) – name (and path) to analysis directory
- property analysis_dir
Return the path to the analysis directory
- get_projects(pattern=None, include_undetermined=True)
Return the analysis projects in a list
By default returns all projects within the analysis
If the ‘pattern’ is not None then it should be a simple pattern used to match against available names to select a subset of projects (see bcf_utils.name_matches).
If ‘include_undetermined’ is True then the undetermined project will also be included; otherwise it will be omitted.
- Parameters:
pattern (str) – (optional) glob-style pattern to match project names against
include_undetermined (bool) – if True (the default) then include the ‘Undetermined’ project
- Returns:
list of AnalysisProject instances.
- Return type:
- property n_projects
Return number of projects found
- property n_sequencing_data
Return number of sequencing data dirs found
- property paired_end
Return True if run is paired end, False if single end
- class auto_process_ngs.analysis.AnalysisFastq(fastq)
Class for extracting information about Fastq files
Given the name of a Fastq file, extract data about the sample name, barcode sequence, lane number, read number and set number.
Uses the IlluminaFastqAttrs class to handle Fastq filenames which consist of a valid Fastq name as defined by IlluminaFastqAttrs, but with additional elements appended.
Instances of this class have the following attributes (defined in the base class):
fastq: the original fastq file name
basename: basename with NGS extensions stripped
extension: full extension e.g. ‘.fastq.gz’
sample_name: name of the sample
sample_number: integer (or None if no sample number)
barcode_sequence: barcode sequence (string or None)
lane_number: integer (or None if no lane number)
read_number: integer (or None if no read number)
set_number: integer (or None if no set number)
is_index_read: boolean (True if index read, False if not)
There are four additional attributes:
format: string identifying the format of the Fastq name (‘Illumina’, ‘SRA’, or None)
implicit_read_number: flag indicating whether the read number was implied (i.e. doesn’t appear explicitly in the name)
canonical_name: the ‘canonical’ part of the name (string, or None if no canonical part could be extracted)
extras: the ‘extra’ part of the name (string, or None if there was no trailing extra part)
- Parameters:
fastq (str) – path or name of Fastq file
- property canonical_name
Return the ‘canonical’ part of the name
- class auto_process_ngs.analysis.AnalysisProject(name, dirn=None, user=None, PI=None, library_type=None, single_cell_platform=None, organism=None, run=None, comments=None, platform=None, sequencer_model=None, fastq_attrs=None, fastq_dir=None)
Class describing an analysis project
Conceptually an analysis project consists of a set of samples from a single sequencing experiment, plus associated data e.g. QC results.
Practically an analysis project is represented by a directory with a set of fastq files.
Provides the following properties:
name: name of the project
dirn: associated directory (full path)
fastq_dirs: list of all subdirectories with fastq files (relative to dirn)
fastq_dir: directory with ‘active’ fastq file set (full path)
fastqs: list of fastq files in fastq_dir
samples: list of AnalysisSample objects generated from fastq_dir
multiple_fastqs: True if at least one sample has more than one fastq file per read associated with it
fastq_format: either ‘fastqgz’ or ‘fastq’
There is also an ‘info’ property with the following additional properties:
run: run name
user: user name
PI: PI name
library_type: library type, either None or e.g. ‘RNA-seq’ etc
single_cell_platform: single cell prep platform, either None or ‘ICell8’ etc
number of cells: number of cells in single cell projects
ICELL8 well list: well list file for ICELL8 single cell projects
organism: organism, either None or e.g. ‘Human’ etc
platform: sequencing platform, either None or e.g. ‘miseq’ etc
comments: additional comments, either None or else string of text
paired_end: True if data is paired end, False if not
primary_fastq_dir: subdirectory holding the ‘primary’ fastq set
sequencer_model: model of sequencer used to generate the data
It is possible for a project to have multiple sets of associated fastq files, held within separate subdirectories of the project directory. A list of subdirectory names with fastq sets can be accessed via the ‘fastq_dirs’ property.
The ‘active’ fastq set defaults to the ‘primary’ set (taken from the ‘primary_fastq_dir’ info property). An alternative active set can be specified using the ‘fastq_dir’ argument when instantiating the AnalysisProject; the active fastq set can also be switched for an existing AnalysisProject using the ‘use_fastq_dir’ method.
The directory holding the primary fastq set is taken from the ‘fastq_dir’ argument of the ‘create_project’ method when creating the project directory (by default this is the ‘fastqs’ subdirectory of the project directory). It can be changed using the ‘set_primary_fastq_dir’ method.
- Parameters:
name (sample) – name of the project (or path to project directory, if ‘dirn’ not supplied)
dirn (str) – optional, project directory (can be full or relative path)
user (str) – optional, specify name of the user PI (str): optional, specify name of the principal investigator(s)
library_type (str) – optional, specify library type e.g. ‘RNA-seq’, ‘miRNA’ etc
single_cell_platform (str) – optional, specify single cell preparation platform e.g. ‘Icell8’, ‘10xGenomics’ etc
organism (str) – optional, specify organism e.g. ‘Human’, ‘Mouse’ etc (separate multiple organisms with ‘;’, use ‘?’ if organism is not known)
platform (str) – optional, specify sequencing platform e.g ‘miseq’
run (str) – optional, name of the run
comments (str) – optional, free text comments associated with the run (separate multiple commenst with ‘;’)
fastq_attrs (BaseFastqAttrs) – optional, specify a class to use to get attributes from a Fastq file name (e.g.
name – ‘AnalysisFastq’.
to (read number etc). Defaults) – ‘AnalysisFastq’.
fastq_dir (str) – optional, explicitly specify the subdirectory holding the set of Fastq files to load; defaults to ‘fastq’ (if present) or to the top-level of the project directory (if absent).
- create_directory(illumina_project=None, fastqs=None, fastq_dir=None, short_fastq_names=False, link_to_fastqs=False)
Create and populate analysis directory for an IlluminaProject
Creates a new directory corresponding to the AnalysisProject object, and optionally also populates with links to FASTQ files from a supplied IlluminaProject object.
The directory structure it creates is:
dir/ fastqs/ logs/ ScriptCode/
It also creates an info file with metadata about the project.
- Parameters:
illumina_project (IlluminaProject) – (optional) populated IlluminaProject object from which the analysis directory will be populated
fastqs (list) – (optional) list of Fastq files to import
fastq_dir (str) – (optional) name of subdirectory to put Fastq files into; defaults to ‘fastqs’
short_fastq_names (bool) – (optional) if True then transform Fastq file names to be the shortest possible unique names; if False (default) then use the original Fastq names
link_to_fastqs (bool) – (optional) if True then make symbolic links to the Fastq files; if False (default) then make hard links
- determine_fastq_format(fastq)
Return type for Fastq file (‘fastq’ or ‘fastqgz’)
- Parameters:
fastq (str) – path or name of Fastq file
- Returns:
either ‘fastqgz’ or ‘fastq’.
- Return type:
String
- determine_paired_end()
Return whether or not project has paired end samples
- property exists
Check if analysis project directory already exists
- property fastqs
Return a list of Fastqs
- property fastqs_are_symlinks
Return True if Fastq files are symbolic links, False if not
- find_fastqs(dirn)
Return list of Fastq files found in directory
- Parameters:
dirn (str) – path to directory to search
- Returns:
list of Fastq file names.
- Return type:
- get_sample(name)
Return sample that matches ‘name’
- Parameters:
name (str) – name of a sample
- Returns:
sample object with the matching name
- Return type:
- Raises
KeyError: if no match is found.
- get_samples(pattern)
Return list of sample matching pattern
- Parameters:
pattern (str) – simple ‘glob’ style pattern
- Returns:
- list of samples with names matching the supplied
pattern (or an empty list if no names match).
- Return type:
- property is_analysis_dir
Determine if directory really is an analysis project
This is a strict test:
the project must contain Fastqs
the project must contain a valid metadata file
- property multiple_fastqs
Determine if there are multiple Fastqs per sample
- populate(fastq_dir=None)
Populate data structure from directory contents
- Parameters:
fastq_dir (str) – (optional) specify the subdirectory with Fastq files to use for populating the ‘AnalysisProject’
- prettyPrintSamples()
Return a nicely formatted string describing the sample names
Wraps a call to ‘pretty_print_names’ function.
- Returns:
pretty description of sample names.
- Return type:
String
- property qc_dir
Return path to default QC outputs directory
- property qc_dirs
List QC output directories
- qc_info(qc_dir)
Fetch the metadata object for with QC dir
- Parameters:
qc_dir (str) – path to QC outputs directory
- Returns:
- metadata object
with the metadata for the QC directory.
- Return type:
- sample_summary()
Generate a summary of the sample names
Generates a description string which summarises the number and names of samples in the project.
The description is of the form:
2 samples (PJB1, PJB2)
- Returns:
summary of sample names.
- Return type:
String
- set_primary_fastq_dir(new_primary_fastq_dir)
Update the primary fastq directory for the project
This sets the primary fastq directory (aka primary fastq set) to the specified name, which must be a subdirectory of the project directory.
Updating the primary fastq directory also causes the ‘samples’ metadata item for the project to be updated.
Relative paths are assumed to be subdirectories of the project directory.
Note that it doesn’t change the active fastq set; use the ‘use_fastq_dir’ method to do this.
- Parameters:
new_primary_fastq_dir (str) – path to the (sub)directory to be treated as the primary Fastq directory for the project
- Raises:
Exception – if specified directory doesn’t exist.
- setup_qc_dir(qc_dir=None, fastq_dir=None)
Set up a QC outputs directory
Creates a QC outputs directory with a metadata file ‘qc.info’.
- Parameters:
qc_dir (str) – path to QC outputs directory to set up. If a relative path is supplied then is assumed to be relative to the analysis project directory. If ‘None’ then defaults to the current ‘qc_dir’ for the project.
fastq_dir (str) – set the associated source Fastq directory (optional). If ‘None’ then defaults to the previously associated fastq_dir for the QC dir (or the current ‘fastq_dir’ for the project if that isn’t set).
- Returns:
full path to the QC directory.
- Return type:
String
- Raises:
Exception – if previously stored Fastq source dir doesn’t match the one supplied via ‘fastq_dir’.
- use_fastq_dir(fastq_dir=None, strict=True)
Switch fastq directory and repopulate
Switch to a specified source fastq dir, or to the primary fastq dir if none is supplied.
Relative paths are assumed to be subdirectories of the project directory.
- Parameters:
fastq_dir (str) – path to the fastq dir to switch to; must be a subdirectory of the project, otherwise an exception is raised unless ‘strict’ is set to False
strict (bool) – if True (the default) then ‘fastq_dir’ must resolve to a subdirectory of the project; otherwise an exception is raised. Setting ‘strict’ to False allows the fastq dir to be outside of the project
- Raises:
Exception – if specified directory is not a subdirectory of the project.
- use_qc_dir(qc_dir)
Switch the default QC outputs directory
- Parameters:
qc_dir (str) – path to new default QC outputs directory. If a relative path is supplied then is assumed to be relative to the analysis project directory.
- class auto_process_ngs.analysis.AnalysisSample(name, fastq_attrs=None)
Class describing an analysis sample
An analysis sample consists of a set of Fastqs files corresponding to a single sample.
AnalysisSample has the following properties:
name: name of the sample
fastq: list of Fastq files associated with the sample
paired_end: True if sample is paired end, False if not
Note that the ‘fastq’ list will include any index read fastqs (i.e. I1/I2) as well as R1/R2 fastqs.
- Parameters:
name (str) – sample name
fastq_attrs (BaseFastqAttrs) – optional, specify a class to use to get attributes from a Fastq file name (e.g. sample name, read number etc). Defaults to ‘AnalysisFastq’.
- add_fastq(fastq)
Add a reference to a Fastq file in the sample
- Parameters:
fastq (str) – full path for the Fastq file
- fastq_subset(read_number=None)
Return a subset of Fastq files from the sample
Note that only R1/R2 files will be returned; index read fastqs (i.e. I1/I2) are excluded regardless of read number.
- Parameters:
read_number (int) – select subset based on read_number (1 or 2)
- Returns:
- list of full paths to Fastq files matching the
selection criteria.
- Return type:
- property fastqs_are_symlinks
Return True if Fastq files are symlinked, False if not
- auto_process_ngs.analysis.copy_analysis_project(project, fastq_dir=None)
Make a copy of an AnalysisProject instance
- Parameters:
project (AnalysisProject) – project intance to copy
fastq_dir (str) – if set then specifies the Fastq subdirectory to use in the new instance
- Returns:
- new AnalysisProject instance which
is a copy of the one supplied on input.
- Return type:
- auto_process_ngs.analysis.locate_project(project_id, start_dir=None, ascend=False)
Locate an analysis project
Searches the file system to locate an analysis project which matches the supplied project identifier.
The identifier can be either:
a path to an analysis project (e.g. ‘/path/to/201029_SN01234_0000123_AHXXXX_analysis/AB’), or
a valid project identifier (e.g. ‘201029_SN01234_0000123_AHXXXX_analysis:AB’, ‘HISEQ_201029#123:AB’ etc)
a project name
- Parameters:
project_id (str) – identifier for the project to locate
start_dir (str) – optional path to start searching from (defaults to the current directory)
ascend (bool) – if True then search by ascending into parent directories of ‘start_dir’ (default is to search by descending into its subdirectories)
- Returns:
- path to the analysis project,
or None if the specified project can’t be located.
- Return type:
- auto_process_ngs.analysis.locate_project_info_file(start_dir)
Locate project metadata file
Searches the current directory and its parents for a project metadata file (‘README.info’), ascending up directory levels until either a valid metadata file is found, or the root of the filesystem is reached.
- Parameters:
start_dir (str) – path of directory to start searching from
- Returns:
- path to the metadata file, or ‘None’ if
no file can be located.
- Return type:
String
- auto_process_ngs.analysis.locate_run(run, start_dir=None, ascend=False)
Locate an analysis directory
Searches the file system to locate an analysis directory which matches the supplied run identifier.
The identifier can be any one of:
a path to an analysis directory (e.g. ‘/path/to/201029_SN01234_0000123_AHXXXX_analysis’)
the name of an analysis directory (e.g. ‘201029_SN01234_0000123_AHXXXX_analysis’)
the name of a sequencing run associated with an analysis directory (e.g. ‘201029_SN01234_0000123_AHXXXX’)
a run identifier (e.g. ‘HISEQ_201029#123’)
If the run identifier is a wildcard (‘*’) then the first valid analysis directory that is encountered on the search path will be matched; note that this may result in non-deterministic behaviour.
- Parameters:
run (str) – identifier for the run to locate
start_dir (str) – optional path to start searching from (defaults to the current directory)
ascend (bool) – if True then search by ascending into parent directories of ‘start_dir’ (default is to search by descending into its subdirectories)
- Returns:
- path to the analysis directory, or
None if the specified analysis directory can’t be located.
- Return type:
String
- auto_process_ngs.analysis.match_run_id(run, d)
Check if a directory matches a run identifier
The identifier can be any one of:
a path to an analysis directory (e.g. ‘/path/to/201029_SN01234_0000123_AHXXXX_analysis’)
the name of an analysis directory (e.g. ‘201029_SN01234_0000123_AHXXXX_analysis’)
the name of a sequencing run associated with an analysis directory (e.g. ‘201029_SN01234_0000123_AHXXXX’)
a run identifier (e.g. ‘HISEQ_201029#123’)
If the identifier is a wildcard (‘*’) then any valid analysis directory will be a match.
- Parameters:
run (str) – run identifier
d (str) – path to a run to check against the supplied identifier
- Returns:
- True if directory matches run ID, False
if not.
- Return type:
Boolean
- auto_process_ngs.analysis.run_id(run_name, platform=None, facility_run_number=None, analysis_number=None)
Return a run ID e.g. ‘HISEQ_140701/242#22’
The run ID is a code that identifies the sequencing run, and has the general form:
PLATFORM_DATESTAMP[/INSTRUMENT_RUN_NUMBER]#FACILITY_RUN_NUMBER[.ANALYSIS_NUMBER]
PLATFORM is always uppercased e.g. HISEQ, MISEQ, GA2X
DATESTAMP is the YYMMDD code e.g. 140701
INSTRUMENT_RUN_NUMBER is the run number that forms part of the run directory e.g. for ‘140701_SN0123_0045_000000000-A1BCD’ it is ‘45’
FACILITY_RUN_NUMBER is the run number that has been assigned by the facility
ANALYSIS_NUMBER is an optional number assigned to the analysis to distinguish it from other analysis attempts (for example, if a run is reprocessed at a later date with updated software)
Note that the instrument run number is only used if it differs from the facility run number.
If the platform isn’t supplied then the instrument name is used instead, e.g.:
SN0123_140701/242#22
If the run name can’t be split into components then the general form will be:
[PLATFORM_]RUN_NAME[#FACILITY_RUN_NUMBER]
depending on whether platform and/or facility run number have been supplied. For example for a run called ‘rag_05_2017’:
MISEQ_rag_05_2017#90
- Parameters:
run_name (str) – the run name (can be a path)
platform (str) – the platform name (optional)
facility_run_number (int) – the run number assigned by the local facility (can be different from the instrument run number) (optional)
analysis_number (int) – number assigned to this analysis to distinguish it from other analysis attempts (optional)
- Returns:
run ID.
- Return type:
String
- auto_process_ngs.analysis.split_sample_name(s)
Split sample name into numerical and non-numerical parts
Utility function which splits the supplied sample name into numerical (i.e. integer) and non-numerical (i.e. all other types of character) parts, and returns the parts as a list.
For example:
>>> split_sample_name("PJB_01-123_T004") ['PJB_',1,'-',123,'_T',4]
- Parameters:
s (str) – the sample name to be split
- Returns:
- list with the numerical and non-numerical parts
of the name.
- Return type:
- auto_process_ngs.analysis.split_sample_reference(s)
Split a ‘[RUN][:PROJECT[/SAMPLE]]’ reference id
Decomposes a reference id of the form:
[RUN][:PROJECT[/SAMPLE]]
where:
RUN is a run identifier (either a run name e.g. ‘201027_SN00284_0000161_AHXXJHJH’, a reference id e.g. ‘HISEQ_201027/161#122’, or a path to an analysis directory)
PROJECT is the name of a project within the run, and
SAMPLE is a sample name.
A subset of elements can be present, in which case the missing components will be returned as None.
- Parameters:
s (str) – sample reference identifier
- Returns:
- tuple of the form (RUN,PROJECT,SAMPLE)
extracted from the supplied reference id; missing elements are set to None.
- Return type:
Tuple