auto_process_ngs.analysis

Classes and functions for handling analysis directories and projects.

Classes:

  • AnalysisFastq: extract information from Fastq name

  • AnalysisDir: API for sequencing run analysis directory

  • AnalysisProject: API for project in an analysis directory

  • AnalysisSample: API for sample in an analysis project

Functions:

  • run_id: fetch run ID for sequencing run

  • split_sample_name: split sample name into components

  • split_sample_reference: split sample reference ID into components

  • match_run_id: check if directory matches run identifier

  • locate_run: search for an analysis directory by ID

  • locate_project: search for an analysis project by ID

  • locate_project_info_file: search for ‘project.info’ file

  • copy_analysis_project: make copy of an AnalysisDir

class auto_process_ngs.analysis.AnalysisDir(analysis_dir)

Class describing an analysis directory

Conceptually an analysis directory maps onto a sequencing run. It consists of one or more sets of samples from that run, which are represented by subdirectories.

It is also possible to have one or more subdirectories containing outputs from the CASAVA or bclToFastq processing software.

Properties:

  • analysis_dir: full path to the directory

  • run_name: the name of the parent sequencing run

  • metadata: metadata items associated with the run

  • projects: list of AnalysisProject objects

  • undetermined: AnalysisProject object for ‘undetermined’

  • sequencing_data: list of IlluminaData objects

  • projects_metadata: metadata from the ‘projects.info’ file

  • datestamp: datestamp extracted from run name

  • instrument_name: instrument name extracted from run name

  • instrument_run_number: run number extracted from run name

  • n_projects: number of projects

  • n_sequencing_data: number of sequencing data directories

  • paired_end: whether data are paired ended

Parameters:

analysis_dir (str) – name (and path) to analysis directory

property analysis_dir

Return the path to the analysis directory

get_projects(pattern=None, include_undetermined=True)

Return the analysis projects in a list

By default returns all projects within the analysis

If the ‘pattern’ is not None then it should be a simple pattern used to match against available names to select a subset of projects (see bcf_utils.name_matches).

If ‘include_undetermined’ is True then the undetermined project will also be included; otherwise it will be omitted.

Parameters:
  • pattern (str) – (optional) glob-style pattern to match project names against

  • include_undetermined (bool) – if True (the default) then include the ‘Undetermined’ project

Returns:

list of AnalysisProject instances.

Return type:

List

property n_projects

Return number of projects found

property n_sequencing_data

Return number of sequencing data dirs found

property paired_end

Return True if run is paired end, False if single end

class auto_process_ngs.analysis.AnalysisFastq(fastq)

Class for extracting information about Fastq files

Given the name of a Fastq file, extract data about the sample name, barcode sequence, lane number, read number and set number.

Uses the IlluminaFastqAttrs class to handle Fastq filenames which consist of a valid Fastq name as defined by IlluminaFastqAttrs, but with additional elements appended.

Instances of this class have the following attributes (defined in the base class):

  • fastq: the original fastq file name

  • basename: basename with NGS extensions stripped

  • extension: full extension e.g. ‘.fastq.gz’

  • sample_name: name of the sample

  • sample_number: integer (or None if no sample number)

  • barcode_sequence: barcode sequence (string or None)

  • lane_number: integer (or None if no lane number)

  • read_number: integer (or None if no read number)

  • set_number: integer (or None if no set number)

  • is_index_read: boolean (True if index read, False if not)

There are four additional attributes:

  • format: string identifying the format of the Fastq name (‘Illumina’, ‘SRA’, or None)

  • implicit_read_number: flag indicating whether the read number was implied (i.e. doesn’t appear explicitly in the name)

  • canonical_name: the ‘canonical’ part of the name (string, or None if no canonical part could be extracted)

  • extras: the ‘extra’ part of the name (string, or None if there was no trailing extra part)

Parameters:

fastq (str) – path or name of Fastq file

property canonical_name

Return the ‘canonical’ part of the name

class auto_process_ngs.analysis.AnalysisProject(name, dirn=None, user=None, PI=None, library_type=None, single_cell_platform=None, organism=None, run=None, comments=None, platform=None, sequencer_model=None, fastq_attrs=None, fastq_dir=None)

Class describing an analysis project

Conceptually an analysis project consists of a set of samples from a single sequencing experiment, plus associated data e.g. QC results.

Practically an analysis project is represented by a directory with a set of fastq files.

Provides the following properties:

  • name: name of the project

  • dirn: associated directory (full path)

  • fastq_dirs: list of all subdirectories with fastq files (relative to dirn)

  • fastq_dir: directory with ‘active’ fastq file set (full path)

  • fastqs: list of fastq files in fastq_dir

  • samples: list of AnalysisSample objects generated from fastq_dir

  • multiple_fastqs: True if at least one sample has more than one fastq file per read associated with it

  • fastq_format: either ‘fastqgz’ or ‘fastq’

There is also an ‘info’ property with the following additional properties:

  • run: run name

  • user: user name

  • PI: PI name

  • library_type: library type, either None or e.g. ‘RNA-seq’ etc

  • single_cell_platform: single cell prep platform, either None or ‘ICell8’ etc

  • number of cells: number of cells in single cell projects

  • ICELL8 well list: well list file for ICELL8 single cell projects

  • organism: organism, either None or e.g. ‘Human’ etc

  • platform: sequencing platform, either None or e.g. ‘miseq’ etc

  • comments: additional comments, either None or else string of text

  • paired_end: True if data is paired end, False if not

  • primary_fastq_dir: subdirectory holding the ‘primary’ fastq set

  • sequencer_model: model of sequencer used to generate the data

It is possible for a project to have multiple sets of associated fastq files, held within separate subdirectories of the project directory. A list of subdirectory names with fastq sets can be accessed via the ‘fastq_dirs’ property.

The ‘active’ fastq set defaults to the ‘primary’ set (taken from the ‘primary_fastq_dir’ info property). An alternative active set can be specified using the ‘fastq_dir’ argument when instantiating the AnalysisProject; the active fastq set can also be switched for an existing AnalysisProject using the ‘use_fastq_dir’ method.

The directory holding the primary fastq set is taken from the ‘fastq_dir’ argument of the ‘create_project’ method when creating the project directory (by default this is the ‘fastqs’ subdirectory of the project directory). It can be changed using the ‘set_primary_fastq_dir’ method.

Parameters:
  • name (sample) – name of the project (or path to project directory, if ‘dirn’ not supplied)

  • dirn (str) – optional, project directory (can be full or relative path)

  • user (str) – optional, specify name of the user PI (str): optional, specify name of the principal investigator(s)

  • library_type (str) – optional, specify library type e.g. ‘RNA-seq’, ‘miRNA’ etc

  • single_cell_platform (str) – optional, specify single cell preparation platform e.g. ‘Icell8’, ‘10xGenomics’ etc

  • organism (str) – optional, specify organism e.g. ‘Human’, ‘Mouse’ etc (separate multiple organisms with ‘;’, use ‘?’ if organism is not known)

  • platform (str) – optional, specify sequencing platform e.g ‘miseq’

  • run (str) – optional, name of the run

  • comments (str) – optional, free text comments associated with the run (separate multiple commenst with ‘;’)

  • fastq_attrs (BaseFastqAttrs) – optional, specify a class to use to get attributes from a Fastq file name (e.g.

  • name – ‘AnalysisFastq’.

  • to (read number etc). Defaults) – ‘AnalysisFastq’.

  • fastq_dir (str) – optional, explicitly specify the subdirectory holding the set of Fastq files to load; defaults to ‘fastq’ (if present) or to the top-level of the project directory (if absent).

create_directory(illumina_project=None, fastqs=None, fastq_dir=None, short_fastq_names=False, link_to_fastqs=False)

Create and populate analysis directory for an IlluminaProject

Creates a new directory corresponding to the AnalysisProject object, and optionally also populates with links to FASTQ files from a supplied IlluminaProject object.

The directory structure it creates is:

dir/
   fastqs/
   logs/
   ScriptCode/

It also creates an info file with metadata about the project.

Parameters:
  • illumina_project (IlluminaProject) – (optional) populated IlluminaProject object from which the analysis directory will be populated

  • fastqs (list) – (optional) list of Fastq files to import

  • fastq_dir (str) – (optional) name of subdirectory to put Fastq files into; defaults to ‘fastqs’

  • short_fastq_names (bool) – (optional) if True then transform Fastq file names to be the shortest possible unique names; if False (default) then use the original Fastq names

  • link_to_fastqs (bool) – (optional) if True then make symbolic links to the Fastq files; if False (default) then make hard links

determine_fastq_format(fastq)

Return type for Fastq file (‘fastq’ or ‘fastqgz’)

Parameters:

fastq (str) – path or name of Fastq file

Returns:

either ‘fastqgz’ or ‘fastq’.

Return type:

String

determine_paired_end()

Return whether or not project has paired end samples

property exists

Check if analysis project directory already exists

property fastqs

Return a list of Fastqs

Return True if Fastq files are symbolic links, False if not

find_fastqs(dirn)

Return list of Fastq files found in directory

Parameters:

dirn (str) – path to directory to search

Returns:

list of Fastq file names.

Return type:

List

get_sample(name)

Return sample that matches ‘name’

Parameters:

name (str) – name of a sample

Returns:

sample object with the matching name

Return type:

AnalysisSample

Raises

KeyError: if no match is found.

get_samples(pattern)

Return list of sample matching pattern

Parameters:

pattern (str) – simple ‘glob’ style pattern

Returns:

list of samples with names matching the supplied

pattern (or an empty list if no names match).

Return type:

List

property is_analysis_dir

Determine if directory really is an analysis project

This is a strict test:

  • the project must contain Fastqs

  • the project must contain a valid metadata file

property multiple_fastqs

Determine if there are multiple Fastqs per sample

populate(fastq_dir=None)

Populate data structure from directory contents

Parameters:

fastq_dir (str) – (optional) specify the subdirectory with Fastq files to use for populating the ‘AnalysisProject’

prettyPrintSamples()

Return a nicely formatted string describing the sample names

Wraps a call to ‘pretty_print_names’ function.

Returns:

pretty description of sample names.

Return type:

String

property qc_dir

Return path to default QC outputs directory

property qc_dirs

List QC output directories

qc_info(qc_dir)

Fetch the metadata object for with QC dir

Parameters:

qc_dir (str) – path to QC outputs directory

Returns:

metadata object

with the metadata for the QC directory.

Return type:

AnalysisProjectQCDirInfo

sample_summary()

Generate a summary of the sample names

Generates a description string which summarises the number and names of samples in the project.

The description is of the form:

2 samples (PJB1, PJB2)
Returns:

summary of sample names.

Return type:

String

set_primary_fastq_dir(new_primary_fastq_dir)

Update the primary fastq directory for the project

This sets the primary fastq directory (aka primary fastq set) to the specified name, which must be a subdirectory of the project directory.

Updating the primary fastq directory also causes the ‘samples’ metadata item for the project to be updated.

Relative paths are assumed to be subdirectories of the project directory.

Note that it doesn’t change the active fastq set; use the ‘use_fastq_dir’ method to do this.

Parameters:

new_primary_fastq_dir (str) – path to the (sub)directory to be treated as the primary Fastq directory for the project

Raises:

Exception – if specified directory doesn’t exist.

setup_qc_dir(qc_dir=None, fastq_dir=None)

Set up a QC outputs directory

Creates a QC outputs directory with a metadata file ‘qc.info’.

Parameters:
  • qc_dir (str) – path to QC outputs directory to set up. If a relative path is supplied then is assumed to be relative to the analysis project directory. If ‘None’ then defaults to the current ‘qc_dir’ for the project.

  • fastq_dir (str) – set the associated source Fastq directory (optional). If ‘None’ then defaults to the previously associated fastq_dir for the QC dir (or the current ‘fastq_dir’ for the project if that isn’t set).

Returns:

full path to the QC directory.

Return type:

String

Raises:

Exception – if previously stored Fastq source dir doesn’t match the one supplied via ‘fastq_dir’.

use_fastq_dir(fastq_dir=None, strict=True)

Switch fastq directory and repopulate

Switch to a specified source fastq dir, or to the primary fastq dir if none is supplied.

Relative paths are assumed to be subdirectories of the project directory.

Parameters:
  • fastq_dir (str) – path to the fastq dir to switch to; must be a subdirectory of the project, otherwise an exception is raised unless ‘strict’ is set to False

  • strict (bool) – if True (the default) then ‘fastq_dir’ must resolve to a subdirectory of the project; otherwise an exception is raised. Setting ‘strict’ to False allows the fastq dir to be outside of the project

Raises:

Exception – if specified directory is not a subdirectory of the project.

use_qc_dir(qc_dir)

Switch the default QC outputs directory

Parameters:

qc_dir (str) – path to new default QC outputs directory. If a relative path is supplied then is assumed to be relative to the analysis project directory.

class auto_process_ngs.analysis.AnalysisSample(name, fastq_attrs=None)

Class describing an analysis sample

An analysis sample consists of a set of Fastqs files corresponding to a single sample.

AnalysisSample has the following properties:

  • name: name of the sample

  • fastq: list of Fastq files associated with the sample

  • paired_end: True if sample is paired end, False if not

Note that the ‘fastq’ list will include any index read fastqs (i.e. I1/I2) as well as R1/R2 fastqs.

Parameters:
  • name (str) – sample name

  • fastq_attrs (BaseFastqAttrs) – optional, specify a class to use to get attributes from a Fastq file name (e.g. sample name, read number etc). Defaults to ‘AnalysisFastq’.

add_fastq(fastq)

Add a reference to a Fastq file in the sample

Parameters:

fastq (str) – full path for the Fastq file

fastq_subset(read_number=None)

Return a subset of Fastq files from the sample

Note that only R1/R2 files will be returned; index read fastqs (i.e. I1/I2) are excluded regardless of read number.

Parameters:

read_number (int) – select subset based on read_number (1 or 2)

Returns:

list of full paths to Fastq files matching the

selection criteria.

Return type:

List

Return True if Fastq files are symlinked, False if not

auto_process_ngs.analysis.copy_analysis_project(project, fastq_dir=None)

Make a copy of an AnalysisProject instance

Parameters:
  • project (AnalysisProject) – project intance to copy

  • fastq_dir (str) – if set then specifies the Fastq subdirectory to use in the new instance

Returns:

new AnalysisProject instance which

is a copy of the one supplied on input.

Return type:

AnalysisProject

auto_process_ngs.analysis.locate_project(project_id, start_dir=None, ascend=False)

Locate an analysis project

Searches the file system to locate an analysis project which matches the supplied project identifier.

The identifier can be either:

  • a path to an analysis project (e.g. ‘/path/to/201029_SN01234_0000123_AHXXXX_analysis/AB’), or

  • a valid project identifier (e.g. ‘201029_SN01234_0000123_AHXXXX_analysis:AB’, ‘HISEQ_201029#123:AB’ etc)

  • a project name

Parameters:
  • project_id (str) – identifier for the project to locate

  • start_dir (str) – optional path to start searching from (defaults to the current directory)

  • ascend (bool) – if True then search by ascending into parent directories of ‘start_dir’ (default is to search by descending into its subdirectories)

Returns:

path to the analysis project,

or None if the specified project can’t be located.

Return type:

AnalysisProject

auto_process_ngs.analysis.locate_project_info_file(start_dir)

Locate project metadata file

Searches the current directory and its parents for a project metadata file (‘README.info’), ascending up directory levels until either a valid metadata file is found, or the root of the filesystem is reached.

Parameters:

start_dir (str) – path of directory to start searching from

Returns:

path to the metadata file, or ‘None’ if

no file can be located.

Return type:

String

auto_process_ngs.analysis.locate_run(run, start_dir=None, ascend=False)

Locate an analysis directory

Searches the file system to locate an analysis directory which matches the supplied run identifier.

The identifier can be any one of:

  • a path to an analysis directory (e.g. ‘/path/to/201029_SN01234_0000123_AHXXXX_analysis’)

  • the name of an analysis directory (e.g. ‘201029_SN01234_0000123_AHXXXX_analysis’)

  • the name of a sequencing run associated with an analysis directory (e.g. ‘201029_SN01234_0000123_AHXXXX’)

  • a run identifier (e.g. ‘HISEQ_201029#123’)

If the run identifier is a wildcard (‘*’) then the first valid analysis directory that is encountered on the search path will be matched; note that this may result in non-deterministic behaviour.

Parameters:
  • run (str) – identifier for the run to locate

  • start_dir (str) – optional path to start searching from (defaults to the current directory)

  • ascend (bool) – if True then search by ascending into parent directories of ‘start_dir’ (default is to search by descending into its subdirectories)

Returns:

path to the analysis directory, or

None if the specified analysis directory can’t be located.

Return type:

String

auto_process_ngs.analysis.match_run_id(run, d)

Check if a directory matches a run identifier

The identifier can be any one of:

  • a path to an analysis directory (e.g. ‘/path/to/201029_SN01234_0000123_AHXXXX_analysis’)

  • the name of an analysis directory (e.g. ‘201029_SN01234_0000123_AHXXXX_analysis’)

  • the name of a sequencing run associated with an analysis directory (e.g. ‘201029_SN01234_0000123_AHXXXX’)

  • a run identifier (e.g. ‘HISEQ_201029#123’)

If the identifier is a wildcard (‘*’) then any valid analysis directory will be a match.

Parameters:
  • run (str) – run identifier

  • d (str) – path to a run to check against the supplied identifier

Returns:

True if directory matches run ID, False

if not.

Return type:

Boolean

auto_process_ngs.analysis.run_id(run_name, platform=None, facility_run_number=None, analysis_number=None)

Return a run ID e.g. ‘HISEQ_140701/242#22’

The run ID is a code that identifies the sequencing run, and has the general form:

PLATFORM_DATESTAMP[/INSTRUMENT_RUN_NUMBER]#FACILITY_RUN_NUMBER[.ANALYSIS_NUMBER]

  • PLATFORM is always uppercased e.g. HISEQ, MISEQ, GA2X

  • DATESTAMP is the YYMMDD code e.g. 140701

  • INSTRUMENT_RUN_NUMBER is the run number that forms part of the run directory e.g. for ‘140701_SN0123_0045_000000000-A1BCD’ it is ‘45’

  • FACILITY_RUN_NUMBER is the run number that has been assigned by the facility

  • ANALYSIS_NUMBER is an optional number assigned to the analysis to distinguish it from other analysis attempts (for example, if a run is reprocessed at a later date with updated software)

Note that the instrument run number is only used if it differs from the facility run number.

If the platform isn’t supplied then the instrument name is used instead, e.g.:

SN0123_140701/242#22

If the run name can’t be split into components then the general form will be:

[PLATFORM_]RUN_NAME[#FACILITY_RUN_NUMBER]

depending on whether platform and/or facility run number have been supplied. For example for a run called ‘rag_05_2017’:

MISEQ_rag_05_2017#90

Parameters:
  • run_name (str) – the run name (can be a path)

  • platform (str) – the platform name (optional)

  • facility_run_number (int) – the run number assigned by the local facility (can be different from the instrument run number) (optional)

  • analysis_number (int) – number assigned to this analysis to distinguish it from other analysis attempts (optional)

Returns:

run ID.

Return type:

String

auto_process_ngs.analysis.split_sample_name(s)

Split sample name into numerical and non-numerical parts

Utility function which splits the supplied sample name into numerical (i.e. integer) and non-numerical (i.e. all other types of character) parts, and returns the parts as a list.

For example:

>>> split_sample_name("PJB_01-123_T004")
['PJB_',1,'-',123,'_T',4]
Parameters:

s (str) – the sample name to be split

Returns:

list with the numerical and non-numerical parts

of the name.

Return type:

List

auto_process_ngs.analysis.split_sample_reference(s)

Split a ‘[RUN][:PROJECT[/SAMPLE]]’ reference id

Decomposes a reference id of the form:

[RUN][:PROJECT[/SAMPLE]]

where:

  • RUN is a run identifier (either a run name e.g. ‘201027_SN00284_0000161_AHXXJHJH’, a reference id e.g. ‘HISEQ_201027/161#122’, or a path to an analysis directory)

  • PROJECT is the name of a project within the run, and

  • SAMPLE is a sample name.

A subset of elements can be present, in which case the missing components will be returned as None.

Parameters:

s (str) – sample reference identifier

Returns:

tuple of the form (RUN,PROJECT,SAMPLE)

extracted from the supplied reference id; missing elements are set to None.

Return type:

Tuple