`auto_process_ngs.analysis`

Classes and functions for handling analysis directories and projects.

Classes:

AnalysisFastq: extract information from Fastq name
AnalysisDir: API for sequencing run analysis directory
AnalysisProject: API for project in an analysis directory
AnalysisSample: API for sample in an analysis project

Functions:

run_id: fetch run ID for sequencing run
split_sample_name: split sample name into components
split_sample_reference: split sample reference ID into components
match_run_id: check if directory matches run identifier
locate_run: search for an analysis directory by ID
locate_project: search for an analysis project by ID
locate_project_info_file: search for ‘project.info’ file
copy_analysis_project: make copy of an AnalysisDir

class auto_process_ngs.analysis.AnalysisDir(analysis_dir)

Class describing an analysis directory

Conceptually an analysis directory maps onto a sequencing run. It consists of one or more sets of samples from that run, which are represented by subdirectories.

It is also possible to have one or more subdirectories containing outputs from the CASAVA or bclToFastq processing software.

Properties:

analysis_dir: full path to the directory
run_name: the name of the parent sequencing run
metadata: metadata items associated with the run
projects: list of AnalysisProject objects
undetermined: AnalysisProject object for ‘undetermined’
sequencing_data: list of IlluminaData objects
projects_metadata: metadata from the ‘projects.info’ file
datestamp: datestamp extracted from run name
instrument_name: instrument name extracted from run name
instrument_run_number: run number extracted from run name
n_projects: number of projects
n_sequencing_data: number of sequencing data directories
paired_end: whether data are paired ended

Parameters:: analysis_dir (str) – name (and path) to analysis directory

property analysis_dir: Return the path to the analysis directory

get_projects(pattern=None, include_undetermined=True)

Return the analysis projects in a list

By default returns all projects within the analysis

If the ‘pattern’ is not None then it should be a simple pattern used to match against available names to select a subset of projects (see bcf_utils.name_matches).

If ‘include_undetermined’ is True then the undetermined project will also be included; otherwise it will be omitted.

Parameters:

pattern (str) – (optional) glob-style pattern to match project names against
include_undetermined (bool) – if True (the default) then include the ‘Undetermined’ project

Returns:

list of AnalysisProject instances.

Return type:

List

property n_projects: Return number of projects found

property n_sequencing_data: Return number of sequencing data dirs found

property paired_end: Return True if run is paired end, False if single end

class auto_process_ngs.analysis.AnalysisFastq(fastq)

Class for extracting information about Fastq files

Given the name of a Fastq file, extract data about the sample name, barcode sequence, lane number, read number and set number.

Uses the IlluminaFastqAttrs class to handle Fastq filenames which consist of a valid Fastq name as defined by IlluminaFastqAttrs, but with additional elements appended.

Instances of this class have the following attributes (defined in the base class):

fastq: the original fastq file name
basename: basename with NGS extensions stripped
extension: full extension e.g. ‘.fastq.gz’
sample_name: name of the sample
sample_number: integer (or None if no sample number)
barcode_sequence: barcode sequence (string or None)
lane_number: integer (or None if no lane number)
read_number: integer (or None if no read number)
set_number: integer (or None if no set number)
is_index_read: boolean (True if index read, False if not)

There are four additional attributes:

format: string identifying the format of the Fastq name (‘Illumina’, ‘SRA’, or None)
implicit_read_number: flag indicating whether the read number was implied (i.e. doesn’t appear explicitly in the name)
canonical_name: the ‘canonical’ part of the name (string, or None if no canonical part could be extracted)
extras: the ‘extra’ part of the name (string, or None if there was no trailing extra part)

Parameters:: fastq (str) – path or name of Fastq file

property canonical_name: Return the ‘canonical’ part of the name

class auto_process_ngs.analysis.AnalysisProject(name, dirn=None, user=None, PI=None, library_type=None, single_cell_platform=None, organism=None, run=None, comments=None, platform=None, sequencer_model=None, fastq_attrs=None, fastq_dir=None)

Class describing an analysis project

Conceptually an analysis project consists of a set of samples from a single sequencing experiment, plus associated data e.g. QC results.

Practically an analysis project is represented by a directory with a set of fastq files.

Provides the following properties:

name: name of the project
dirn: associated directory (full path)
fastq_dirs: list of all subdirectories with fastq files (relative to dirn)
fastq_dir: directory with ‘active’ fastq file set (full path)
fastqs: list of fastq files in fastq_dir
samples: list of AnalysisSample objects generated from fastq_dir
multiple_fastqs: True if at least one sample has more than one fastq file per read associated with it
fastq_format: either ‘fastqgz’ or ‘fastq’

There is also an ‘info’ property with the following additional properties:

run: run name
user: user name
PI: PI name
library_type: library type, either None or e.g. ‘RNA-seq’ etc
single_cell_platform: single cell prep platform, either None or ‘ICell8’ etc
number of cells: number of cells in single cell projects
ICELL8 well list: well list file for ICELL8 single cell projects
organism: organism, either None or e.g. ‘Human’ etc
platform: sequencing platform, either None or e.g. ‘miseq’ etc
comments: additional comments, either None or else string of text
paired_end: True if data is paired end, False if not
primary_fastq_dir: subdirectory holding the ‘primary’ fastq set
sequencer_model: model of sequencer used to generate the data

It is possible for a project to have multiple sets of associated fastq files, held within separate subdirectories of the project directory. A list of subdirectory names with fastq sets can be accessed via the ‘fastq_dirs’ property.

The ‘active’ fastq set defaults to the ‘primary’ set (taken from the ‘primary_fastq_dir’ info property). An alternative active set can be specified using the ‘fastq_dir’ argument when instantiating the AnalysisProject; the active fastq set can also be switched for an existing AnalysisProject using the ‘use_fastq_dir’ method.

The directory holding the primary fastq set is taken from the ‘fastq_dir’ argument of the ‘create_project’ method when creating the project directory (by default this is the ‘fastqs’ subdirectory of the project directory). It can be changed using the ‘set_primary_fastq_dir’ method.

Parameters:

name (sample) – name of the project (or path to project directory, if ‘dirn’ not supplied)
dirn (str) – optional, project directory (can be full or relative path)
user (str) – optional, specify name of the user PI (str): optional, specify name of the principal investigator(s)
library_type (str) – optional, specify library type e.g. ‘RNA-seq’, ‘miRNA’ etc
single_cell_platform (str) – optional, specify single cell preparation platform e.g. ‘Icell8’, ‘10xGenomics’ etc
organism (str) – optional, specify organism e.g. ‘Human’, ‘Mouse’ etc (separate multiple organisms with ‘;’, use ‘?’ if organism is not known)
platform (str) – optional, specify sequencing platform e.g ‘miseq’
run (str) – optional, name of the run
comments (str) – optional, free text comments associated with the run (separate multiple commenst with ‘;’)
fastq_attrs (BaseFastqAttrs) – optional, specify a class to use to get attributes from a Fastq file name (e.g.
name – ‘AnalysisFastq’.
to (read number etc). Defaults) – ‘AnalysisFastq’.
fastq_dir (str) – optional, explicitly specify the subdirectory holding the set of Fastq files to load; defaults to ‘fastq’ (if present) or to the top-level of the project directory (if absent).

create_directory(illumina_project=None, fastqs=None, fastq_dir=None, short_fastq_names=False, link_to_fastqs=False)

Create and populate analysis directory for an IlluminaProject

Creates a new directory corresponding to the AnalysisProject object, and optionally also populates with links to FASTQ files from a supplied IlluminaProject object.

The directory structure it creates is:

dir/
   fastqs/
   logs/
   ScriptCode/

It also creates an info file with metadata about the project.

Parameters:

illumina_project (IlluminaProject) – (optional) populated IlluminaProject object from which the analysis directory will be populated
fastqs (list) – (optional) list of Fastq files to import
fastq_dir (str) – (optional) name of subdirectory to put Fastq files into; defaults to ‘fastqs’
short_fastq_names (bool) – (optional) if True then transform Fastq file names to be the shortest possible unique names; if False (default) then use the original Fastq names
link_to_fastqs (bool) – (optional) if True then make symbolic links to the Fastq files; if False (default) then make hard links

determine_fastq_format(fastq)

Return type for Fastq file (‘fastq’ or ‘fastqgz’)

Parameters:: fastq (str) – path or name of Fastq file
Returns:: either ‘fastqgz’ or ‘fastq’.
Return type:: String

determine_paired_end(): Return whether or not project has paired end samples

property exists: Check if analysis project directory already exists

property fastqs: Return a list of Fastqs

property fastqs_are_symlinks: Return True if Fastq files are symbolic links, False if not

find_fastqs(dirn)

Return list of Fastq files found in directory

Parameters:: dirn (str) – path to directory to search
Returns:: list of Fastq file names.
Return type:: List

get_sample(name)

Return sample that matches ‘name’

Parameters:: name (str) – name of a sample
Returns:: sample object with the matching name
Return type:: AnalysisSample

Raises: KeyError: if no match is found.

get_samples(pattern)

Return list of sample matching pattern

Parameters:

pattern (str) – simple ‘glob’ style pattern

Returns:

list of samples with names matching the supplied: pattern (or an empty list if no names match).

Return type:

List

property is_analysis_dir

Determine if directory really is an analysis project

This is a strict test:

the project must contain Fastqs
the project must contain a valid metadata file

property multiple_fastqs: Determine if there are multiple Fastqs per sample

populate(fastq_dir=None)

Populate data structure from directory contents

Parameters:: fastq_dir (str) – (optional) specify the subdirectory with Fastq files to use for populating the ‘AnalysisProject’

prettyPrintSamples()

Return a nicely formatted string describing the sample names

Wraps a call to ‘pretty_print_names’ function.

Returns:: pretty description of sample names.
Return type:: String

property qc_dir: Return path to default QC outputs directory

property qc_dirs: List QC output directories

qc_info(qc_dir)

Fetch the metadata object for with QC dir

Parameters:

qc_dir (str) – path to QC outputs directory

Returns:

metadata object: with the metadata for the QC directory.

Return type:

AnalysisProjectQCDirInfo

sample_summary()

Generate a summary of the sample names

Generates a description string which summarises the number and names of samples in the project.

The description is of the form:

2 samples (PJB1, PJB2)

Returns:: summary of sample names.
Return type:: String

set_primary_fastq_dir(new_primary_fastq_dir)

Update the primary fastq directory for the project

This sets the primary fastq directory (aka primary fastq set) to the specified name, which must be a subdirectory of the project directory.

Updating the primary fastq directory also causes the ‘samples’ metadata item for the project to be updated.

Relative paths are assumed to be subdirectories of the project directory.

Note that it doesn’t change the active fastq set; use the ‘use_fastq_dir’ method to do this.

Parameters:: new_primary_fastq_dir (str) – path to the (sub)directory to be treated as the primary Fastq directory for the project
Raises:: Exception – if specified directory doesn’t exist.

setup_qc_dir(qc_dir=None, fastq_dir=None)

Set up a QC outputs directory

Creates a QC outputs directory with a metadata file ‘qc.info’.

Parameters:

qc_dir (str) – path to QC outputs directory to set up. If a relative path is supplied then is assumed to be relative to the analysis project directory. If ‘None’ then defaults to the current ‘qc_dir’ for the project.
fastq_dir (str) – set the associated source Fastq directory (optional). If ‘None’ then defaults to the previously associated fastq_dir for the QC dir (or the current ‘fastq_dir’ for the project if that isn’t set).

Returns:

full path to the QC directory.

Return type:

String

Raises:

Exception – if previously stored Fastq source dir doesn’t match the one supplied via ‘fastq_dir’.

use_fastq_dir(fastq_dir=None, strict=True)

Switch fastq directory and repopulate

Switch to a specified source fastq dir, or to the primary fastq dir if none is supplied.

Relative paths are assumed to be subdirectories of the project directory.

Parameters:

fastq_dir (str) – path to the fastq dir to switch to; must be a subdirectory of the project, otherwise an exception is raised unless ‘strict’ is set to False
strict (bool) – if True (the default) then ‘fastq_dir’ must resolve to a subdirectory of the project; otherwise an exception is raised. Setting ‘strict’ to False allows the fastq dir to be outside of the project

Raises:

Exception – if specified directory is not a subdirectory of the project.

use_qc_dir(qc_dir)

Switch the default QC outputs directory

Parameters:: qc_dir (str) – path to new default QC outputs directory. If a relative path is supplied then is assumed to be relative to the analysis project directory.

class auto_process_ngs.analysis.AnalysisSample(name, fastq_attrs=None)

Class describing an analysis sample

An analysis sample consists of a set of Fastqs files corresponding to a single sample.

AnalysisSample has the following properties:

name: name of the sample
fastq: list of Fastq files associated with the sample
paired_end: True if sample is paired end, False if not

Note that the ‘fastq’ list will include any index read fastqs (i.e. I1/I2) as well as R1/R2 fastqs.

Parameters:

name (str) – sample name
fastq_attrs (BaseFastqAttrs) – optional, specify a class to use to get attributes from a Fastq file name (e.g. sample name, read number etc). Defaults to ‘AnalysisFastq’.

add_fastq(fastq)

Add a reference to a Fastq file in the sample

Parameters:: fastq (str) – full path for the Fastq file

fastq_subset(read_number=None)

Return a subset of Fastq files from the sample

Note that only R1/R2 files will be returned; index read fastqs (i.e. I1/I2) are excluded regardless of read number.

Parameters:

read_number (int) – select subset based on read_number (1 or 2)

Returns:

list of full paths to Fastq files matching the: selection criteria.

Return type:

List

property fastqs_are_symlinks: Return True if Fastq files are symlinked, False if not

auto_process_ngs.analysis.copy_analysis_project(project, fastq_dir=None)

Make a copy of an AnalysisProject instance

Parameters:

project (AnalysisProject) – project intance to copy
fastq_dir (str) – if set then specifies the Fastq subdirectory to use in the new instance

Returns:

new AnalysisProject instance which: is a copy of the one supplied on input.

Return type:

AnalysisProject

auto_process_ngs.analysis.locate_project(project_id, start_dir=None, ascend=False)

Locate an analysis project

Searches the file system to locate an analysis project which matches the supplied project identifier.

The identifier can be either:

a path to an analysis project (e.g. ‘/path/to/201029_SN01234_0000123_AHXXXX_analysis/AB’), or
a valid project identifier (e.g. ‘201029_SN01234_0000123_AHXXXX_analysis:AB’, ‘HISEQ_201029#123:AB’ etc)
a project name

Parameters:

project_id (str) – identifier for the project to locate
start_dir (str) – optional path to start searching from (defaults to the current directory)
ascend (bool) – if True then search by ascending into parent directories of ‘start_dir’ (default is to search by descending into its subdirectories)

Returns:

path to the analysis project,: or None if the specified project can’t be located.

Return type:

AnalysisProject

auto_process_ngs.analysis.locate_project_info_file(start_dir)

Locate project metadata file

Searches the current directory and its parents for a project metadata file (‘README.info’), ascending up directory levels until either a valid metadata file is found, or the root of the filesystem is reached.

Parameters:

start_dir (str) – path of directory to start searching from

Returns:

path to the metadata file, or ‘None’ if: no file can be located.

Return type:

String

auto_process_ngs.analysis.locate_run(run, start_dir=None, ascend=False)

Locate an analysis directory

Searches the file system to locate an analysis directory which matches the supplied run identifier.

The identifier can be any one of:

a path to an analysis directory (e.g. ‘/path/to/201029_SN01234_0000123_AHXXXX_analysis’)
the name of an analysis directory (e.g. ‘201029_SN01234_0000123_AHXXXX_analysis’)
the name of a sequencing run associated with an analysis directory (e.g. ‘201029_SN01234_0000123_AHXXXX’)
a run identifier (e.g. ‘HISEQ_201029#123’)

If the run identifier is a wildcard (‘*’) then the first valid analysis directory that is encountered on the search path will be matched; note that this may result in non-deterministic behaviour.

Parameters:

run (str) – identifier for the run to locate
start_dir (str) – optional path to start searching from (defaults to the current directory)
ascend (bool) – if True then search by ascending into parent directories of ‘start_dir’ (default is to search by descending into its subdirectories)

Returns:

path to the analysis directory, or: None if the specified analysis directory can’t be located.

Return type:

String

auto_process_ngs.analysis.match_run_id(run, d)

Check if a directory matches a run identifier

The identifier can be any one of:

a path to an analysis directory (e.g. ‘/path/to/201029_SN01234_0000123_AHXXXX_analysis’)
the name of an analysis directory (e.g. ‘201029_SN01234_0000123_AHXXXX_analysis’)
the name of a sequencing run associated with an analysis directory (e.g. ‘201029_SN01234_0000123_AHXXXX’)
a run identifier (e.g. ‘HISEQ_201029#123’)

If the identifier is a wildcard (‘*’) then any valid analysis directory will be a match.

Parameters:

run (str) – run identifier
d (str) – path to a run to check against the supplied identifier

Returns:

True if directory matches run ID, False: if not.

Return type:

Boolean

auto_process_ngs.analysis.run_id(run_name, platform=None, facility_run_number=None, analysis_number=None)

Return a run ID e.g. ‘HISEQ_140701/242#22’

The run ID is a code that identifies the sequencing run, and has the general form:

PLATFORM_DATESTAMP[/INSTRUMENT_RUN_NUMBER]#FACILITY_RUN_NUMBER[.ANALYSIS_NUMBER]

PLATFORM is always uppercased e.g. HISEQ, MISEQ, GA2X
DATESTAMP is the YYMMDD code e.g. 140701
INSTRUMENT_RUN_NUMBER is the run number that forms part of the run directory e.g. for ‘140701_SN0123_0045_000000000-A1BCD’ it is ‘45’
FACILITY_RUN_NUMBER is the run number that has been assigned by the facility
ANALYSIS_NUMBER is an optional number assigned to the analysis to distinguish it from other analysis attempts (for example, if a run is reprocessed at a later date with updated software)

Note that the instrument run number is only used if it differs from the facility run number.

If the platform isn’t supplied then the instrument name is used instead, e.g.:

SN0123_140701/242#22

If the run name can’t be split into components then the general form will be:

[PLATFORM_]RUN_NAME[#FACILITY_RUN_NUMBER]

depending on whether platform and/or facility run number have been supplied. For example for a run called ‘rag_05_2017’:

MISEQ_rag_05_2017#90

Parameters:

run_name (str) – the run name (can be a path)
platform (str) – the platform name (optional)
facility_run_number (int) – the run number assigned by the local facility (can be different from the instrument run number) (optional)
analysis_number (int) – number assigned to this analysis to distinguish it from other analysis attempts (optional)

Returns:

run ID.

Return type:

String

auto_process_ngs.analysis.split_sample_name(s)

Split sample name into numerical and non-numerical parts

Utility function which splits the supplied sample name into numerical (i.e. integer) and non-numerical (i.e. all other types of character) parts, and returns the parts as a list.

For example:

>>> split_sample_name("PJB_01-123_T004")
['PJB_',1,'-',123,'_T',4]

Parameters:

s (str) – the sample name to be split

Returns:

list with the numerical and non-numerical parts: of the name.

Return type:

List

auto_process_ngs.analysis.split_sample_reference(s)

Split a ‘[RUN][:PROJECT[/SAMPLE]]’ reference id

Decomposes a reference id of the form:

[RUN][:PROJECT[/SAMPLE]]

where:

RUN is a run identifier (either a run name e.g. ‘201027_SN00284_0000161_AHXXJHJH’, a reference id e.g. ‘HISEQ_201027/161#122’, or a path to an analysis directory)
PROJECT is the name of a project within the run, and
SAMPLE is a sample name.

A subset of elements can be present, in which case the missing components will be returned as None.

Parameters:

s (str) – sample reference identifier

Returns:

tuple of the form (RUN,PROJECT,SAMPLE): extracted from the supplied reference id; missing elements are set to None.

Return type:

Tuple

auto_process_ngs.analysis

`auto_process_ngs.analysis`