auto_process_ngs.metadata

metadata

Classes for storing, accessing and updating metadata for analysis directories, projects and so on.

Classes:

  • MetadataDict:

  • AnalysisDirParameters:

  • AnalysisDirMetadata:

  • AnalysisProjectInfo:

  • ProjectMetadataFile:

  • AnalysisProjectQCDirInfo:

class auto_process_ngs.metadata.AnalysisDirMetadata(filen=None)

Class for storing metadata about an analysis directory

Provides a set of data items representing metadata about the current analysis, which are loaded from and saved to an external file.

The metadata items are:

run_name: name of the run run_number: run number assigned by local facility analysis_number: arbitrary number assigned to analysis (to

distinguish it from other analysis attempts)

source: source of the data (e.g. local facility) platform: sequencing platform e.g. ‘miseq’ processing_software: dictionary of software packages used in

in the processing

bcl2fastq_software: info on the Bcl conversion software used

(deprecated)

cellranger_software: info on the 10xGenomics cellranger software

used (deprecated)

instrument_name: name/i.d. for the sequencing instrument instrument_datestamp: datestamp from the sequencing instrument instrument_run_number: the run number from the sequencing

instrument

instrument_flow_cell_id: the flow cell ID from the sequencing

instrument

sequencer_model: the model of the sequencing instrument flow_cell_mode: the flow cell configuration (if present in the

run parameters)

run_configuration: read names & lengths derived from RunInfo.xml default_bases_mask: default bases mask derived from RunInfo.xml

class auto_process_ngs.metadata.AnalysisDirParameters(filen=None)

Class for storing parameters in an analysis directory

Provides a set of data items representing parameters for the current analysis, which are loaded from and saved to an external file.

The parameter data items are:

analysis_dir: path to the analysis directory data_dir: path to the directory holding the raw sequencing data platform: sequencing platform e.g. ‘miseq’ sample_sheet: path to the customised SampleSheet.csv file bases_mask: bases mask string project_metadata: name of the project metadata file primary_data_dir: directory used to hold copies of primary data acquired_primary_data: whether primary data has been copied unaligned_dir: output directory for bcl2fastq conversion barcode_analysis_dir: directory holding barcode analysis outputs stats_file: name of file with per-fastq statistics per_lane_stats_file: name of file with per-lane statistics

class auto_process_ngs.metadata.AnalysisProjectInfo(filen=None)

Class for storing metadata in an analysis project

Provides a set of metadata items which are loaded from and saved to an external file.

The data items are:

name: the project name run: the name of the sequencing run platform: the sequencing platform name e.g. ‘miseq’ sequencer_model: the sequencer model e.g. ‘MiSeq’ user: the user associated with the project PI: the principal investigator associated with the project organism: the organism associated with the project library_type: the library type e.g. ‘RNA-seq’ single_cell_platform: the single cell preparation platform number of cells: number of cells in single cell projects ICELL8 well list: well list file for ICELL8 single cell projects paired_end: True if the data is paired end, False if not primary_fastq_dir: the primary subdir with FASTQ files samples: textual description of the samples in the project biological_samples: comma-separated sample names with biological data multiplexed_samples: comma-separated names of multiplexed samples comments: free-text comments

class auto_process_ngs.metadata.AnalysisProjectQCDirInfo(filen=None)

Class for storing metadata for a QC output directory

Provides a set of metadata items which are loaded from and saved to an external file.

The data items are:

fastq_dir: the name of the associated Fastq subdirectory fastqs: list of the Fastq files (without leading paths) protocol: name of the QC protocol used organism: the organism(s) that the QC was run with seq_data_samples: samples with sequence (i.e. biological) data cellranger_version: version of cellranger/10x software used cellranger_refdata: reference datasets used with cellranger cellranger_probeset: probe set used with cellranger fastq_screens: names of panels used with fastq_screen star_index: index used by STAR annotation_bed: BED file with gene annotation annotation_gtf: GTF file with gene annotation protocol_summary: free-text summary of the QC protocol protocol_specification: full QC protocol specification

class auto_process_ngs.metadata.MetadataDict(attributes={}, order=None, filen=None)

Class for storing metadata in an analysis project

Provides storage for arbitrary data items in the form of key-value pairs, which can be saved to and loaded from an external file.

The data items are defined on instantiation via a dictionary supplied to the ‘attributes’ argument. For example:

Create a new metadata object: >>> metadata = MetadataDict(attributes={‘salutation’:’Salutation’, … ‘valediction’: ‘Valediction’})

The dictionary keys correspond to the keys in the MetadataDict object; the corresponding values are the keys that are used when saving and loading the data to and from a file.

Set attributes: >>> metadata[‘salutation’] = ‘hello’ >>> metadata[‘valediction’] = ‘goodbye’

Retrieve values: >>> print(“Salutation is %s” % metadata.salutation)

Save to file: >>> metadata.save(‘metadata.tsv’)

Load data from a file: >>> metadata = MetadataDict(‘metadata.tsv’) or >>> metadata = MetadataDict() >>> metadata.load(‘metadata.tsv’)

List items with ‘null’ values: >>> metadata.null_items()

The external file storage is intended to be readable by humans so longer names are used to describe the keys; also Python None values are stored as ‘.’, and True and False values are stored as ‘Y’ and ‘N’ respectively. These values are automatically converted back to the Python equivalents on reload.

keys_in_file()

Return a list of the key names found explicitly in the file

load(filen, strict=True, fail_on_error=False, enable_fallback=False)

Load key-value pairs from a tab-delimited file

Loads the key-value pairs from a previously created tab-delimited file written by the ‘save’ method.

Note that this overwrites any existing values already assigned to keys within the metadata object.

Parameters:
  • filen (str) – name of the tab-delimited file with key-value pairs

  • strict (bool) – if True (default) then discard items in the input file which are missing from the definition; if False then add them to the definition.

  • fail_on_error (bool) – if True then raise an exception if the file contains invalid content (if ‘strict’ is also specified then this includes any unrecognised keys); default is to warn and then ignore these errors.

null_items()

Return a list of data items with ‘null’ values

save(filen=None)

Save metadata to tab-delimited file

Writes key-value paires to a tab-delimited file. The data can be recovered using the ‘load’ method.

Note that if the specified file already exists then it will be overwritten.

Parameters:

filen – name of the tab-delimited file with key-value pairs; if None then the file specified when the object was instantiated will be used instead.

class auto_process_ngs.metadata.ProjectMetadataFile(filen=None)

File containing metadata about multiple projects in analysis dir

The file consists of a header line plus one line per project with the following tab-delimited fields:

Project: name of the project Samples: list/description of sample names User: name(s) of the associated user(s) Library: the library type Single cell platform: single-cell preparation platform (e.g. ‘ICELL8’) Organism: name(s) of the organism(s) PI: name(s) of the associated principal investigator(s) Comments: free text containing additional information

about the project

Any fields set to None will be written to file with a ‘.’ placeholder.

add_project(project_name, sample_names, **kws)

Add information about a project into the file

Parameters:
  • project_name (str) – name of the new project

  • sample_names (list) – Python list of sample names

  • user (str) – (optional) user name(s)

  • library_type (str) – (optional) library type

  • sc_platform (str) – (optional) single-cell prep platform

  • organism (str) – (optional) organism(s)

  • PI (str) – (optional) principal investigator name(s)

  • comments (str) – (optional) additional information about the project

lookup(project_name)

Return data for line with specified project name

Leading comment characters (i.e. ‘#’) are ignored when performing the lookup.

save(filen=None)

Save the data back to file

Parameters:

filen – name of the file to save to (if not specified then defaults to the same file as data was read in from)

update_project(project_name, **kws)

Update information about a project in the file

Parameters:
  • project_name (str) – name of the project to update

  • sample_names (list) – (optional) Python list of new sample names

  • user (str) – (optional) new user name(s)

  • library_type (str) – (optional) new library type

  • sc_platform (str) – (optional) single-cell prep platform

  • organism (str) – (optional) new organism(s)

  • PI (str) – (optional) new principal investigator name(s)

  • comments (str) – (optional) new additional information about the project