Managing and sharing data
These additional tasks are not really part of the automated processing, but the utilities for performing them are currently part of the same package so they are outlined here.
There are no specific utilities for exporting Fastqs to a Galaxy data library, but suggestions on how to do this can be found in the section Exporting Fastqs to a data library in a local Galaxy instance.
fetch_data.py: import files and directories
The fetch_data.py utility can be used to copy arbitrary files and
directories onto the local system. As part of this it will also replace
spaces in path names with underscores (for example if copying data
from a Windows-based system to Linux system).
Note
The initial use case for this utility is to copy image files for 10x
Genomics Visium datasets into the Visium_images subdirectory of
the project directory.
General usage is:
fetch_data.py SOURCE_DIR LOCAL_DIR
which will import the contents of the directory SOURCE_DIR and
put them into LOCAL_DIR. By default the directory structure of the
imported data is preserved, but the --flatten option can be
used to copy files without the intermediate subdirectories.
If an existing file is found in the destination with the same name as
an imported file that the copy is skipped for that file; specifying the
--overwrite option means that the existing file will be replaced
with the imported version.
SOURCE_DIR can either be a local directory, or a directory on a
remote server. The syntax [USER@]HOST:DIR can be used to specify
a remote source.
A single file can also be imported:
fetch_data.py SOURCE_FILE DEST
which will copy SOURCE_FILE into directory DEST (if DEST
is an existing directory), or copy and rename as file DEST (if
DEST doesn’t exist).
manage_fastqs.py: managing and copy Fastq files
The manage_fastqs.py utility can be used to explore and copy data from
an analysis directory to another location on a per-project basis.
For example, to get a list of projects within a run:
manage_fastqs.py ANALYSIS_DIR
Note
If a project contains multiple sets of FASTQs then these will be listed under the project name, with the default “primary” project marked with an asterisk.
To see a list of the FASTQ files associated with a particular project:
manage_fastqs.py ANALYSIS_DIR PROJECTNAME
To copy the FASTQs to a local or remote directory, use the copy command:
manage_fastqs.py ANALYSIS_DIR PROJECTNAME copy /path/to/local/dir
manage_fastqs.py ANALYSIS_DIR PROJECTNAME copy me@remote.org:/path/to/remote/dir
This will also generate an MD5 checksum file for the transferred files; to
generate the MD5 sums on their own, use the md5 command:
manage_fastqs.py ANALYSIS_DIR PROJECTNAME md5
Finally to make a zip file containing the Fastqs, use the zip command:
manage_fastqs.py ANALYSIS_DIR PROJECTNAME zip
Note
The zip option works best if the Fastqs are relatively small.
Working with multiple Fastq sets
If a project has more than one Fastq set associated with it then by
default the operations described above will use the “primary” set
(typically, the set of Fastq files in the fastqs subdirectory
of the project).
To operate on an alternative set, use the --fastq_dir option to
switch e.g.:
manage_fastqs.py ANALYSIS_DIR PROJECTNAME --fastq_dir=ALT_FASTQS_DIR
Handling subsets of files
Use the --filter option to work with a subset of files - this allows a
‘glob’-style pattern to be specified so that only files with matching names
will be included.
For example to only copy R1 files:
manage_fastqs.py ANALYSIS_DIR PROJECTNAME copy /path/to/local/dir --filter *_R1_*
transfer_data.py: copying data for transfer to end users
Overview
The transfer_data.py utility can be used to copy data from analysis
projects to different destinations, typically to transfer copies of
data to end users.
A destination is defined as a local or remote directory where files will be copied, for example in its most basic mode:
transfer_data.py /mnt/data/shared PROJECT_DIR
will copy Fastq files from the project referenced by PROJECT_PATH to
the local directory /mnt/data/shared.
Destinations can also be defined in the configuration file (see Data transfer destinations) and then referred to by their name when copying the Fastqs.
For example:
transfer_data.py webserver PROJECT_DIR
where webserver is a pre-defined destination.
A job runner can be specified in the configuration file to use for
computationally-intensive operations (for example copying large files
or creating archives), by setting the transfer_data runner (see
Job Runners). If this isn’t set then the rsync runner will
be used instead.
The --dry-run option can be used to check what the utility will
do without actually performing any operations.
Schemes for dymanic subdirectory specification
By default the data are copied directly to the specified directory. However it is possible to specify a scheme for dynamic subdirectory assignment, which can be useful for example if copying to a webserver.
The scheme can be specified via either the --subdir command line
option or the subdir parameter in the configuration file.
The following schemes are available:
Scheme name |
Behaviour |
|---|---|
|
Locates an empty pre-existing subdirectory (aka ‘bin’) at random |
|
Creates a new subdirectory named
|
Generating a README file from a template
It is possible to generate a README for the copied data by
specifying a template file via either the --readme command line
option or the readme_template parameter in the configuration
file.
The template should be a plain text file but it can also contain
placeholders for ‘template variables’ which will be substituted with
the appropriate values when the README file is generated:
Placeholder |
Value |
|---|---|
|
Run platform (uppercase) |
|
Run number |
|
Run datestamp |
|
Name of project being copied |
|
Base URL for the webserver |
|
Name of the subdirectory, if any |
|
Directory data were copied to |
|
Today’s date |
Including downloader, QC reports and 10xGenomics pipeline outputs
By default only Fastqs are copied by transfer_data.py, however it
is possible to include additional files:
A standalone downloader script (see download_fastqs.py: fetch Fastqs from a webserver in batch) (specify the
--include_downloaderoption or set theinclude_downloaderparameter in the configuration);The zipped QC reports for the project (specify the
--include_qc_reportoption or set theinclude_qc_reportparameter)Outputs from 10xGenomics pipelines (e.g.
cellranger count) packaged into atgzarchive (specify the--include_10x_outputsoption)
.cloupefiles from 10xGenomics pipeline outputs collected into a.ziparchive (specify the--include_cloupe_filesoption)Image files in the
Visium_imagessubdirectory of 10xGenomics Visium datasets packaged into atgzarchive (specify the--include_visium_imagesoption)
Hard linking Fastqs
When sharing Fastqs via a local directory which is on the same file
system as the original files, it is possible to make hard links to
the Fastqs rather than making copies by specifying the --link
option (or setting the hard_links parameter).
Linking Fastqs is quicker than copying and saves space as hard links reference the same copy of the file’s data on the file system.
Bundling Fastqs into ZIP archives
For datasets with contain very large numbers of Fastq files it may be undesirable to share the individual Fastqs (for example when downloading from a web server, or uploading to a file transfer service such as ZendTo).
In these cases the following options can be used:
--zip_fastqswill bundle the Fastqs into one or more ZIP archives (instead of copying each Fastq individually)If specified then the
--max_zip_sizeoption additionally sets the maximum size for each ZIP archive, resulting in multiple ZIPs if the dataset cannot be put into a single archive of this size.
download_fastqs.py: fetch Fastqs from a webserver in batch
Fastq files pushed to a webserver using manage_fastqs.py can be retrieved
in batch using the download_fastqs.py utility:
download_fastqs.py http://example.com/fastqs/
This fetches the checksum file from the URL and then uses that to get a
list of Fastq files to download. Once the files are downloaded it runs
the Linux md5sum program to verify the integrity of the downloads.
Note
This utility is stand-alone so it can be sent to end users and used independently of other components of the autoprocess package.
update_project_metadata.py: manage metadata associated with a project
The projects within a run each have a file called README.info which is
used to hold metadata about that project (for example, user, PI, organism,
library type and so on).
Use the update_project_metadata.py utility to check and update the
metadata associated with a project, for example to update the PI:
update_project_metadata.py ANALYSIS_DIR PROJECT -u PI="Andrew Jones"
Note
Project directories created using very old versions of auto_process,
or predating the automated processing system, might not have metadata
files. To create one use:
update_project_metadata.py ANALYSIS_DIR PROJECT -i
before using -u to populate the fields.
audit_projects.py: auditing disk usage for multiple runs
Collections of runs that are copied to an ‘archive’ location via the
archive function of auto_process.py will form a directory structure
of the form:
ARCHIVE_DIR/
|
+--- 2015/
|
+--- hiseq/
|
+--- 150429_HISEQ_XXYYY_12345BB_analysis/
|
+--- 150408_HISEQ_XXYYY_67890CC_analysis/
|
.
Within each run dir there will be one or more project directories.
The projects can be audited according to PI and disk usage using the
audit_projects.py utility, for example:
audit_projects.py ARCHIVE_DIR/2015/hiseq/
Multiple directories can be specified, e.g.:
audit_projects.py ARCHIVE_DIR/2015/hiseq/ ARCHIVE_DIR/2014/hiseq/
This will print out a summary of usage for each PI, e.g.:
Summary (PI, # of projects, total usage):
=========================================
Peter Brooks 12 3.7T
Trevor Smith 8 2.3T
Donald Raymond 6 2.2T
...
Total usage 164 22.3T
plus a breakdown of the usage for each of the projects belonging to each PI, for example:
Breakdown by PI/project:
========================
Peter Brooks:
150121_HISEQ001_0123_ABCD123XX: SteveAustin 128.1G
150306_HISEQ001_0234_ABCD123XX: MartinLouis 159.7G
150415_HISEQ001_0345_ABCD123XX: MartinLouis 72.8G
...
There is also a summary of the amount of space used for storing the ‘undetermined’ read data, for each run.
Note
The disk usage for each file is calculated by using Python’s os.lstat
function to get the number of 512-byte blocks per file. The total usage
is then the sum of all the files and directories.
However these values can differ from the sizes returned by the Linux
du program, for various reasons including using a different block
size (e.g. du uses 1024-byte blocks). So the returned values should
not be treated as absolutes.
Exporting Fastqs to a data library in a local Galaxy instance
Upload of Fastq files from a run into a data library on a Galaxy instance
can be performed using the nebulizer utility.
Note
You will need access to an admin account on the target Galaxy server to create and add to the data libraries.
The create_library and create_library_folder commands can be used
to make the target data library and folder, if these don’t already exist -
for example:
nebulizer create_library MyGalaxy "MISEQ_190626#26" \
--description "Data from MISEQ run 26 datestamp 190626"
nebulizer create_library_folder MyGalaxy "MISEQ_190626#26/Fastqs"
would create a data library called MISEQ_190626#26 on the MyGalaxy instance, and a new folder called Fastqs within that library.
Then the add_library_datasets command can be used to upload Fastqs
to the library.
To upload files from the local system to the server:
nebulizer add_library_datasets MyGalaxy /path/to/fastqs/PB_S1_R1_001.fastq.gz ...
If the files are on the same system as the Galaxy server then the
--server option can be used, for example:
nebulizer add_library_datasets mygalaxy --server Data_Library/Fastqs /path/to/fastqs/on/server/PB_S1_R1_001.fastq.gz ...
It is possible in this case to get Galaxy to create links to the Fastqs
(rather than making copies) which can potentially save time and disk
space, by including the --link option:
nebulizer add_library_datasets mygalaxy --server --link Data_Library/Fastqs /path/to/fastqs/on/server/PB_S1_R1_001.fastq
Warning
Making links only seems to work for uncompressed Fastq files.
For information on nebulizer see
https://nebulizer.readthedocs.io/en/latest/