auto_process_ngs.utils
Utility classes and functions to support auto_process_ngs module.
Classes:
OutputFiles:
BufferedOutputFiles:
ZipArchive:
ProgressChecker:
FileLock:
FileLockError:
Location: extracts information from a location specifier
Functions:
fetch_file:
bases_mask_is_paired_end:
get_organism_list:
normalise_organism_name:
split_user_host_dir:
get_numbered_subdir:
find_executables:
parse_version:
parse_samplesheet_spec:
pretty_print_rows:
sort_sample_names:
write_script_file:
edit_file:
paginate:
- class auto_process_ngs.utils.BufferedOutputFiles(base_dir=None, bufsize=8192, max_open_files=100)
Class for managing multiple output files with buffering
Version of the ‘OutputFiles’ class which buffers writing of data, to reduce number of underlying write operations.
Usage is similar to OutputFiles, with additional ‘bufsize’ argument which can be used to set the buffer size to use.
- close(name=None)
Close one or all open files
If a ‘name’ is specified then only the file matching that handle will be closed; with no arguments all open files will be closed.
- open(name, filen=None, append=False)
Open a new output file
‘name’ is the handle used to reference the file when using the ‘write’ and ‘close’ methods.
‘filen’ is the name of the file, and is unrelated to the handle. If not supplied then ‘name’ must be associated with a previously closed file (which will be reopened).
If the filename ends with ‘.gz’ then the associated file will automatically be written as a gzip-compressed file.
If ‘append’ is True then append to an existing file rather than overwriting (i.e. use mode ‘at’ instead of ‘wt’).
- write(name, s)
Write content to file (newline-terminated)
Writes ‘s’ as a newline-terminated string to the file that is referenced with the handle ‘name’.
- class auto_process_ngs.utils.FileLock(f, timeout=None)
Class for locking filesystem objects across processes
Usage:
>>> # Make new FileLock instance for cwd >>> lock = FileLock('.') >>> # Check that instance doesn't hold the lock >>> lock.has_lock False >>> # Acquire the lock >>> lock.acquire() >>> lock.has_lock True >>> # Release the lock >>> lock.release()
If another FileLock instance holds the lock (within this process, or within another) then
FileLock.acquire()
will raise an immediateFileLockError
; however, specifying a timeout period means that the instance will keep retrying to acquire the lock until either it is successful, or the timeout period is exceeded (again returning aFileLockError
exception).The FileLock class can also be used as a context manager, e.g.
>>> with FileLock('.'): ... # Lock pwd while you do stuff
Uses the fcntl module (see https://docs.python.org/3.6/library/fcntl.html, https://stackoverflow.com/a/32650956 and https://stackoverflow.com/a/55011593)
- Parameters:
f (str) – path to filesystem object (file or directory) to lock
timeout (float) – optional, specifies a timeout period after which failure to acquire the lock raises an exception
- acquire(timeout=None)
Acquire the lock on the filesystem object
- Parameters:
timeout (float) – optional, specifies a timeout period after which failure to acquire the lock raises an exception (NB overrides timeout set on instantiation).
- property has_lock
Check if the FileLock instance holds the lock
- release()
Release the lock on the filesystem object
- exception auto_process_ngs.utils.FileLockError
Exceptions associated with the FileLock class
- class auto_process_ngs.utils.Location(location)
Class for examining a file-system location specifier
A location specifier can be a local or a remote file or directory. The general form is:
[[user@]server:]path
For a local location, only the ‘path’ component needs to be supplied.
For a remote location, ‘server’ and ‘path’ must be supplied, while ‘user’ is optional.
Alternatively the location can be a URL identifier of the form:
protocol://server/path
The following properties are available:
user: the user name (or None if not specified)
server: the server name (or None if not specified)
path: the path component
is_remote: True if the location is on a remote host, False if it is local (or if it is a URL)
is_url: True if the location points to a URL
url: the URL identifier, if the location points to a URL (or None if not a URL)
protocol: the URL protocol (or None if not a URL)
- Parameters:
location (str) – location specifer of the form ‘[[user@]server:]path’
- property is_remote
Check if location is on a remote server
- property is_url
Check if location is a URL
- property path
]path’
- Type:
Return ‘path’ part of ‘[[user@]server
- property protocol
Return URL protocol (or None if not a URL)
- property server
]path’
- Type:
Return ‘server’ part of ‘[[user@]server
- property url
Return path as a URL (or None if not a URL)
- property user
]path’
- Type:
Return ‘user’ part of ‘[[user@]server
- class auto_process_ngs.utils.OutputFiles(base_dir=None)
Class for managing multiple output files
Usage:
Create a new OutputFiles instance: >>> fp = OutputFiles()
Set up files against keys: >>> fp.open(‘file1’,’first_file.txt’) >>> fp.open(‘file2’,’second_file.txt’)
Write content to files: >>> fp.write(‘file1’,’some content for first file’) >>> fp.write(‘file2’,’content for
second file’)
Append content to an existing file: >>> fp.open(‘file3’,’third_file.txt’,append=True) >>> fp.write(‘file2’,’appended content’)
Check if key exists and associated file handle is available for writing: >>> ‘file1’ in fp True >>> ‘file3’ in fp False
Finish and close all open files >>> fp.close()
Reopen and append to a previously opened and closed file: >>> fp.open(‘file4’,’fourth_file.txt’) >>> fp.write(‘file4’,’some content’) >>> fp.close(‘file4’) >>> fp.open(‘file4’,append=True) >>> fp.write(‘file4’,’more content’)
- close(name=None)
Close one or all open files
If a ‘name’ is specified then only the file matching that handle will be closed; with no arguments all open files will be closed.
- file_name(name)
Get the file name associated with a handle
NB the file name will be available even if the file has been closed.
Raises KeyError if the key doesn’t exist.
- open(name, filen=None, append=False)
Open a new output file
‘name’ is the handle used to reference the file when using the ‘write’ and ‘close’ methods.
‘filen’ is the name of the file, and is unrelated to the handle. If not supplied then ‘name’ must be associated with a previously closed file (which will be reopened).
If ‘append’ is True then append to an existing file rather than overwriting (i.e. use mode ‘at’ instead of ‘wt’).
- write(name, s)
Write content to file (newline-terminated)
Writes ‘s’ as a newline-terminated string to the file that is referenced with the handle ‘name’.
- class auto_process_ngs.utils.ProgressChecker(every=None, percent=None, total=None)
Check if an index is a multiple of a value or percentage
Utility class to help with reporting progress of iterations over large numbers of items.
Typically progress would only be reported after a certain number or percentage of items have been consumed; the ProgressChecker can be used to check if this number or percentage has been reached.
Example usage: to report after every 100th item:
>>> progress = ProgressChecker(every=100) >>> for i in range(10000): >>> if progress.check(i): >>> print("Item %d" % i)
To report every 5% of items:
>>> nitems = 10000 >>> progress = ProgressChecker(percent=5,total=nitems) >>> for i in range(nitems): >>> if progress.check(i): >>> print("Item %d (%.2f%%)" % (i,progress.percent(i)))
- check(i)
Check index to see if it matches the interval
- Parameters:
i (int) – index to check
- Returns:
- True if index matches the interval,
False if not.
- Return type:
Boolean
- percent(i)
Convert index to a percentage
- Parameters:
i (int) – index to convert
- Returns:
- index expressed as a percentage of the
total number of items.
- Return type:
Float
- class auto_process_ngs.utils.ZipArchive(zip_file, contents=None, relpath=None, prefix=None)
Utility class for creating .zip archive files
Example usage:
>>> z = ZipArchive('test.zip',relpath='/data') >>> z.add('/data/file1') # Add a single file >>> z.add('/data/dir2/') # Add a directory and all contents >>> z.close() # to write the archive
- add(item)
Add an item (file or directory) to the zip archive
- add_dir(dirn)
Recursively add a directory and its contents
- add_file(filen)
Add a file to the zip archive
- auto_process_ngs.utils.edit_file(filen, editor='vi', append=None)
Send a file to an editor
Creates a temporary copy of a file and opens an editor to allow the user to make changes. Any edits are saved back to the original file.
- Parameters:
filen (str) – path to the file to be edited
editor (str) – optional, editor command to be used (will be overriden by user’s EDITOR environment variable even if set). Defaults to ‘vi’.
append (str) – optional, if set then append the supplied text to the end of the file before editing. NB the text will only be kept if the user saves a change to the file in the editor.
- auto_process_ngs.utils.fetch_file(src, dest=None)
Fetch a copy of a file from an arbitrary location
Gets a copy of a file which can be specified as either a local or a remote path (i.e. using the syntax
[[USER@]HOST:]PATH
) or as URL.If a destination file name is not supplied then the destination file name will be that of the source file and will be copied to the current working directory; if the supplied destination is a directory then the destination file name will be that of the source file and will be copied to that directory.
The destination must be a local file or directory.
- Parameters:
src (str) – path or URL of the file to be copied
dest (str) – optional, local destination file name or directory to copy
src
to
- Returns:
path to the copy.
- Return type:
String
- auto_process_ngs.utils.find_executables(names, info_func, reqs=None, paths=None)
List available executables matching list of names
By default searches the PATH for the executables listed in ‘names’, using the supplied ‘info_func’ to acquire package names and versions of each, returns a list of executables with the full path, package and version.
‘info_func’ is a function that must be supplied by the calling subprogram. Its signature should look like:
>>> def info_func(p): ... # Determine full_path, package_name and ... # version ... # Then return these as a tuple ... return (full_path,package_name,version)
The ‘reqs’ argument allows a specific version or range of versions to be requested; in this case the returned list will only contain those packages which satisfy the requested versions.
A range of version specifications can be requested by separating multiple specifiers with a comma - for example ‘>1.8.3,<2.16’.
The full set of operators is:
==, >, >=, <=, <
If no versions are requested then the packages will be returned in PATH order; otherwise they will be returned in version order (highest to lowest).
- Parameters:
names (list) – list of executable names to look for. These can be full paths or executables with no leading paths
info_func (function) – function to use to get tuples of (full_path,package_name,version) for an executable
reqs (str) – optional version requirement expression (for example ‘>=1.8.4’). If supplied then only executables fulfilling the requirement will be returned. If no operator is supplied then ‘==’ is implied.
paths (list) – optional set of directory paths to search when looking for executables. If not supplied then the set of paths specified in the PATH environment variable will be searched.
- Returns:
- full paths to executables matching specified
criteria.
- Return type:
- auto_process_ngs.utils.get_numbered_subdir(name, parent_dir=None, full_path=False)
Return a name for a new numbered log subdirectory
Generates the name for a numbered subdirectory.
Subdirectories are named as NNN_<name> e.g. 001_setup, 002_make_fastqs etc.
‘Gaps’ are ignored, so the number associated with the new name will be one plus the highest index that already exists.
Note that a directory is not created - this must be done by the calling subprogram. As a result there is the possibility of a race condition.
- Parameters:
name (str) – name for the subdirectory (typically the name of the processing stage that will produce logs to be written to the subdirs
parent_dir (str) – path to the parent directory where the indexed directory would be created; defaults to CWD if not set
full_path (bool) – if True then return the full path for the new subdirectory; default is to return the name relative to the parent directory
- Returns:
- name for the new log subdirectory
(will be the full path if ‘full_path’ was specified).
- Return type:
String
- auto_process_ngs.utils.get_organism_list(organisms)
Return a list of normalised organism names
Normalisation consists of converting names to lower case and spaces to underscores.
E.g.
“Human,Mouse” -> [‘human’,’mouse’] “Xenopus tropicalis” -> [‘xenopus_tropicalis’]
- Parameters:
organisms (str) – string with organism names separated by commas
- Returns:
list of normalised organism names
- Return type:
- auto_process_ngs.utils.normalise_organism_name(name)
Return normalised organism name
Normalisation consists of converting names to lower case and spaces to underscores.
E.g.
“Human” -> ‘human’ “Xenopus tropicalis” -> ‘xenopus_tropicalis’
- Parameters:
name (str) – organism name
- Returns:
normalised organism name
- Return type:
String
- auto_process_ngs.utils.paginate(text)
Send text to stdout with pagination
If the function detects that the stdout is an interactive terminal then the supplied text will be piped via a paginator command.
The pager command will be the default for
pydoc
, but can be over-ridden by thePAGER
environment variable.If stdout is not a terminal (for example if it’s being set to a file, or piped to another command) then the pagination is skipped.
- Parameters:
text (str) – text to be printed using pagination
- auto_process_ngs.utils.parse_samplesheet_spec(s)
Split sample sheet line specification into components
Given a sample sheet line specification of the form “[LANES:][COL=PATTERN:]VALUE”, split into the component parts and return as a tuple:
(VALUE,LANES,COL,PATTERN)
where ‘LANES’ is a list of lanes (or ‘None’, if not present), and the others will be strings (or ‘None’, if not present).
- Parameters:
s (str) – specification string
- Returns:
- the extracted components as
(VALUE,LANES,COL,PATTERN).
- Return type:
Tuple
- auto_process_ngs.utils.parse_version(s)
Split a version string into a tuple for comparison
Given a version string of the form e.g. “X.Y.Z”, return a tuple of the components e.g. (X,Y,Z)
Where possible components will be coverted to integers.
If the version string is empty then the version number will be set to an arbitrary negative integer.
Typically the result from this function would not be used directly, instead it is used to compare two versions, for example:
>>> parse_version("2.17") < parse_version("1.8") False
- Parameters:
s (str) – version string
- Returns:
tuple of the version string
- Return type:
Tuple
- auto_process_ngs.utils.pretty_print_rows(data, prepend=False)
Format row-wise data into ‘pretty’ lines
Given ‘row-wise’ data (in the form of a list of lists), for example:
[[‘hello’,’A salutation’],[goodbye’,’The End’]]
formats into a string of text where lines are newline-separated and the ‘fields’ are padded with spaces so that they line up left-justified in columns, for example:
hello A salutation goodbye The End
- Parameters:
data – row-wise data as a list of lists
prepend – (optional), if True then columns are right-justified (i.e. padding is added before each value).
- auto_process_ngs.utils.sort_sample_names(samples)
Given a list of sample names, sort into human-friendly order
- auto_process_ngs.utils.write_script_file(script_file, contents, append=False, shell=None)
Write command to file
- Parameters:
script_file (str) – path of file to write command to
contents (str) – content to write to the file
append (bool) – optional, if True and script_file exists then append content (default is to overwrite existing contents)
shell – optional, if set then defines the shell to specify after ‘!#’