auto_process_ngs.utils

Utility classes and functions to support auto_process_ngs module.

Classes:

  • OutputFiles:

  • BufferedOutputFiles:

  • ZipArchive:

  • ProgressChecker:

  • FileLock:

  • FileLockError:

  • Location: extracts information from a location specifier

Functions:

  • fetch_file:

  • bases_mask_is_paired_end:

  • get_organism_list:

  • normalise_organism_name:

  • split_user_host_dir:

  • get_numbered_subdir:

  • find_executables:

  • parse_version:

  • parse_samplesheet_spec:

  • pretty_print_rows:

  • sort_sample_names:

  • write_script_file:

  • edit_file:

  • paginate:

class auto_process_ngs.utils.BufferedOutputFiles(base_dir=None, bufsize=8192, max_open_files=100)

Class for managing multiple output files with buffering

Version of the ‘OutputFiles’ class which buffers writing of data, to reduce number of underlying write operations.

Usage is similar to OutputFiles, with additional ‘bufsize’ argument which can be used to set the buffer size to use.

close(name=None)

Close one or all open files

If a ‘name’ is specified then only the file matching that handle will be closed; with no arguments all open files will be closed.

open(name, filen=None, append=False)

Open a new output file

‘name’ is the handle used to reference the file when using the ‘write’ and ‘close’ methods.

‘filen’ is the name of the file, and is unrelated to the handle. If not supplied then ‘name’ must be associated with a previously closed file (which will be reopened).

If the filename ends with ‘.gz’ then the associated file will automatically be written as a gzip-compressed file.

If ‘append’ is True then append to an existing file rather than overwriting (i.e. use mode ‘at’ instead of ‘wt’).

write(name, s)

Write content to file (newline-terminated)

Writes ‘s’ as a newline-terminated string to the file that is referenced with the handle ‘name’.

class auto_process_ngs.utils.FileLock(f, timeout=None)

Class for locking filesystem objects across processes

Usage:

>>> # Make new FileLock instance for cwd
>>> lock = FileLock('.')
>>> # Check that instance doesn't hold the lock
>>> lock.has_lock
False
>>> # Acquire the lock
>>> lock.acquire()
>>> lock.has_lock
True
>>> # Release the lock
>>> lock.release()

If another FileLock instance holds the lock (within this process, or within another) then FileLock.acquire() will raise an immediate FileLockError; however, specifying a timeout period means that the instance will keep retrying to acquire the lock until either it is successful, or the timeout period is exceeded (again returning a FileLockError exception).

The FileLock class can also be used as a context manager, e.g.

>>> with FileLock('.'):
...   # Lock pwd while you do stuff

Uses the fcntl module (see https://docs.python.org/3.6/library/fcntl.html, https://stackoverflow.com/a/32650956 and https://stackoverflow.com/a/55011593)

Parameters:
  • f (str) – path to filesystem object (file or directory) to lock

  • timeout (float) – optional, specifies a timeout period after which failure to acquire the lock raises an exception

acquire(timeout=None)

Acquire the lock on the filesystem object

Parameters:

timeout (float) – optional, specifies a timeout period after which failure to acquire the lock raises an exception (NB overrides timeout set on instantiation).

property has_lock

Check if the FileLock instance holds the lock

release()

Release the lock on the filesystem object

exception auto_process_ngs.utils.FileLockError

Exceptions associated with the FileLock class

class auto_process_ngs.utils.Location(location)

Class for examining a file-system location specifier

A location specifier can be a local or a remote file or directory. The general form is:

[[user@]server:]path

For a local location, only the ‘path’ component needs to be supplied.

For a remote location, ‘server’ and ‘path’ must be supplied, while ‘user’ is optional.

Alternatively the location can be a URL identifier of the form:

protocol://server/path

The following properties are available:

  • user: the user name (or None if not specified)

  • server: the server name (or None if not specified)

  • path: the path component

  • is_remote: True if the location is on a remote host, False if it is local (or if it is a URL)

  • is_url: True if the location points to a URL

  • url: the URL identifier, if the location points to a URL (or None if not a URL)

  • protocol: the URL protocol (or None if not a URL)

Parameters:

location (str) – location specifer of the form ‘[[user@]server:]path’

property is_remote

Check if location is on a remote server

property is_url

Check if location is a URL

property path

]path’

Type:

Return ‘path’ part of ‘[[user@]server

property protocol

Return URL protocol (or None if not a URL)

property server

]path’

Type:

Return ‘server’ part of ‘[[user@]server

property url

Return path as a URL (or None if not a URL)

property user

]path’

Type:

Return ‘user’ part of ‘[[user@]server

class auto_process_ngs.utils.OutputFiles(base_dir=None)

Class for managing multiple output files

Usage:

Create a new OutputFiles instance: >>> fp = OutputFiles()

Set up files against keys: >>> fp.open(‘file1’,’first_file.txt’) >>> fp.open(‘file2’,’second_file.txt’)

Write content to files: >>> fp.write(‘file1’,’some content for first file’) >>> fp.write(‘file2’,’content for

second file’)

Append content to an existing file: >>> fp.open(‘file3’,’third_file.txt’,append=True) >>> fp.write(‘file2’,’appended content’)

Check if key exists and associated file handle is available for writing: >>> ‘file1’ in fp True >>> ‘file3’ in fp False

Finish and close all open files >>> fp.close()

Reopen and append to a previously opened and closed file: >>> fp.open(‘file4’,’fourth_file.txt’) >>> fp.write(‘file4’,’some content’) >>> fp.close(‘file4’) >>> fp.open(‘file4’,append=True) >>> fp.write(‘file4’,’more content’)

close(name=None)

Close one or all open files

If a ‘name’ is specified then only the file matching that handle will be closed; with no arguments all open files will be closed.

file_name(name)

Get the file name associated with a handle

NB the file name will be available even if the file has been closed.

Raises KeyError if the key doesn’t exist.

open(name, filen=None, append=False)

Open a new output file

‘name’ is the handle used to reference the file when using the ‘write’ and ‘close’ methods.

‘filen’ is the name of the file, and is unrelated to the handle. If not supplied then ‘name’ must be associated with a previously closed file (which will be reopened).

If ‘append’ is True then append to an existing file rather than overwriting (i.e. use mode ‘at’ instead of ‘wt’).

write(name, s)

Write content to file (newline-terminated)

Writes ‘s’ as a newline-terminated string to the file that is referenced with the handle ‘name’.

class auto_process_ngs.utils.ProgressChecker(every=None, percent=None, total=None)

Check if an index is a multiple of a value or percentage

Utility class to help with reporting progress of iterations over large numbers of items.

Typically progress would only be reported after a certain number or percentage of items have been consumed; the ProgressChecker can be used to check if this number or percentage has been reached.

Example usage: to report after every 100th item:

>>> progress = ProgressChecker(every=100)
>>> for i in range(10000):
>>>    if progress.check(i):
>>>       print("Item %d" % i)

To report every 5% of items:

>>> nitems = 10000
>>> progress = ProgressChecker(percent=5,total=nitems)
>>> for i in range(nitems):
>>>    if progress.check(i):
>>>       print("Item %d (%.2f%%)" % (i,progress.percent(i)))
check(i)

Check index to see if it matches the interval

Parameters:

i (int) – index to check

Returns:

True if index matches the interval,

False if not.

Return type:

Boolean

percent(i)

Convert index to a percentage

Parameters:

i (int) – index to convert

Returns:

index expressed as a percentage of the

total number of items.

Return type:

Float

class auto_process_ngs.utils.ZipArchive(zip_file, contents=None, relpath=None, prefix=None)

Utility class for creating .zip archive files

Example usage:

>>> z = ZipArchive('test.zip',relpath='/data')
>>> z.add('/data/file1') # Add a single file
>>> z.add('/data/dir2/') # Add a directory and all contents
>>> z.close()  # to write the archive
add(item)

Add an item (file or directory) to the zip archive

add_dir(dirn)

Recursively add a directory and its contents

add_file(filen)

Add a file to the zip archive

auto_process_ngs.utils.edit_file(filen, editor='vi', append=None)

Send a file to an editor

Creates a temporary copy of a file and opens an editor to allow the user to make changes. Any edits are saved back to the original file.

Parameters:
  • filen (str) – path to the file to be edited

  • editor (str) – optional, editor command to be used (will be overriden by user’s EDITOR environment variable even if set). Defaults to ‘vi’.

  • append (str) – optional, if set then append the supplied text to the end of the file before editing. NB the text will only be kept if the user saves a change to the file in the editor.

auto_process_ngs.utils.fetch_file(src, dest=None)

Fetch a copy of a file from an arbitrary location

Gets a copy of a file which can be specified as either a local or a remote path (i.e. using the syntax [[USER@]HOST:]PATH) or as URL.

If a destination file name is not supplied then the destination file name will be that of the source file and will be copied to the current working directory; if the supplied destination is a directory then the destination file name will be that of the source file and will be copied to that directory.

The destination must be a local file or directory.

Parameters:
  • src (str) – path or URL of the file to be copied

  • dest (str) – optional, local destination file name or directory to copy src to

Returns:

path to the copy.

Return type:

String

auto_process_ngs.utils.find_executables(names, info_func, reqs=None, paths=None)

List available executables matching list of names

By default searches the PATH for the executables listed in ‘names’, using the supplied ‘info_func’ to acquire package names and versions of each, returns a list of executables with the full path, package and version.

‘info_func’ is a function that must be supplied by the calling subprogram. Its signature should look like:

>>> def info_func(p):
...   # Determine full_path, package_name and
...   # version
...   # Then return these as a tuple
...   return (full_path,package_name,version)

The ‘reqs’ argument allows a specific version or range of versions to be requested; in this case the returned list will only contain those packages which satisfy the requested versions.

A range of version specifications can be requested by separating multiple specifiers with a comma - for example ‘>1.8.3,<2.16’.

The full set of operators is:

  • ==, >, >=, <=, <

If no versions are requested then the packages will be returned in PATH order; otherwise they will be returned in version order (highest to lowest).

Parameters:
  • names (list) – list of executable names to look for. These can be full paths or executables with no leading paths

  • info_func (function) – function to use to get tuples of (full_path,package_name,version) for an executable

  • reqs (str) – optional version requirement expression (for example ‘>=1.8.4’). If supplied then only executables fulfilling the requirement will be returned. If no operator is supplied then ‘==’ is implied.

  • paths (list) – optional set of directory paths to search when looking for executables. If not supplied then the set of paths specified in the PATH environment variable will be searched.

Returns:

full paths to executables matching specified

criteria.

Return type:

List

auto_process_ngs.utils.get_numbered_subdir(name, parent_dir=None, full_path=False)

Return a name for a new numbered log subdirectory

Generates the name for a numbered subdirectory.

Subdirectories are named as NNN_<name> e.g. 001_setup, 002_make_fastqs etc.

‘Gaps’ are ignored, so the number associated with the new name will be one plus the highest index that already exists.

Note that a directory is not created - this must be done by the calling subprogram. As a result there is the possibility of a race condition.

Parameters:
  • name (str) – name for the subdirectory (typically the name of the processing stage that will produce logs to be written to the subdirs

  • parent_dir (str) – path to the parent directory where the indexed directory would be created; defaults to CWD if not set

  • full_path (bool) – if True then return the full path for the new subdirectory; default is to return the name relative to the parent directory

Returns:

name for the new log subdirectory

(will be the full path if ‘full_path’ was specified).

Return type:

String

auto_process_ngs.utils.get_organism_list(organisms)

Return a list of normalised organism names

Normalisation consists of converting names to lower case and spaces to underscores.

E.g.

“Human,Mouse” -> [‘human’,’mouse’] “Xenopus tropicalis” -> [‘xenopus_tropicalis’]

Parameters:

organisms (str) – string with organism names separated by commas

Returns:

list of normalised organism names

Return type:

List

auto_process_ngs.utils.normalise_organism_name(name)

Return normalised organism name

Normalisation consists of converting names to lower case and spaces to underscores.

E.g.

“Human” -> ‘human’ “Xenopus tropicalis” -> ‘xenopus_tropicalis’

Parameters:

name (str) – organism name

Returns:

normalised organism name

Return type:

String

auto_process_ngs.utils.paginate(text)

Send text to stdout with pagination

If the function detects that the stdout is an interactive terminal then the supplied text will be piped via a paginator command.

The pager command will be the default for pydoc, but can be over-ridden by the PAGER environment variable.

If stdout is not a terminal (for example if it’s being set to a file, or piped to another command) then the pagination is skipped.

Parameters:

text (str) – text to be printed using pagination

auto_process_ngs.utils.parse_samplesheet_spec(s)

Split sample sheet line specification into components

Given a sample sheet line specification of the form “[LANES:][COL=PATTERN:]VALUE”, split into the component parts and return as a tuple:

(VALUE,LANES,COL,PATTERN)

where ‘LANES’ is a list of lanes (or ‘None’, if not present), and the others will be strings (or ‘None’, if not present).

Parameters:

s (str) – specification string

Returns:

the extracted components as

(VALUE,LANES,COL,PATTERN).

Return type:

Tuple

auto_process_ngs.utils.parse_version(s)

Split a version string into a tuple for comparison

Given a version string of the form e.g. “X.Y.Z”, return a tuple of the components e.g. (X,Y,Z)

Where possible components will be coverted to integers.

If the version string is empty then the version number will be set to an arbitrary negative integer.

Typically the result from this function would not be used directly, instead it is used to compare two versions, for example:

>>> parse_version("2.17") < parse_version("1.8")
False
Parameters:

s (str) – version string

Returns:

tuple of the version string

Return type:

Tuple

auto_process_ngs.utils.pretty_print_rows(data, prepend=False)

Format row-wise data into ‘pretty’ lines

Given ‘row-wise’ data (in the form of a list of lists), for example:

[[‘hello’,’A salutation’],[goodbye’,’The End’]]

formats into a string of text where lines are newline-separated and the ‘fields’ are padded with spaces so that they line up left-justified in columns, for example:

hello A salutation goodbye The End

Parameters:
  • data – row-wise data as a list of lists

  • prepend – (optional), if True then columns are right-justified (i.e. padding is added before each value).

auto_process_ngs.utils.sort_sample_names(samples)

Given a list of sample names, sort into human-friendly order

auto_process_ngs.utils.write_script_file(script_file, contents, append=False, shell=None)

Write command to file

Parameters:
  • script_file (str) – path of file to write command to

  • contents (str) – content to write to the file

  • append (bool) – optional, if True and script_file exists then append content (default is to overwrite existing contents)

  • shell – optional, if set then defines the shell to specify after ‘!#’