Managing and sharing data ========================= These additional tasks are not really part of the automated processing, but the utilities for performing them are currently part of the same package so they are outlined here. * :ref:`fetch_data` * :ref:`manage_fastqs` * :ref:`transfer_data` * :ref:`download_fastqs` * :ref:`update_project_metadata` * :ref:`audit_projects` There are no specific utilities for exporting Fastqs to a Galaxy data library, but suggestions on how to do this can be found in the section :ref:`exporting_to_galaxy`. .. _fetch_data: ``fetch_data.py``: import files and directories *********************************************** The ``fetch_data.py`` utility can be used to copy arbitrary files and directories onto the local system. As part of this it will also replace spaces in path names with underscores (for example if copying data from a Windows-based system to Linux system). .. note:: The initial use case for this utility is to copy image files for 10x Genomics Visium datasets into the ``Visium_images`` subdirectory of the project directory. General usage is: :: fetch_data.py SOURCE_DIR LOCAL_DIR which will import the contents of the directory ``SOURCE_DIR`` and put them into ``LOCAL_DIR``. By default the directory structure of the imported data is preserved, but the ``--flatten`` option can be used to copy files without the intermediate subdirectories. If an existing file is found in the destination with the same name as an imported file that the copy is skipped for that file; specifying the ``--overwrite`` option means that the existing file will be replaced with the imported version. ``SOURCE_DIR`` can either be a local directory, or a directory on a remote server. The syntax ``[USER@]HOST:DIR`` can be used to specify a remote source. A single file can also be imported: :: fetch_data.py SOURCE_FILE DEST which will copy ``SOURCE_FILE`` into directory ``DEST`` (if ``DEST`` is an existing directory), or copy and rename as file ``DEST`` (if ``DEST`` doesn't exist). .. _manage_fastqs: ``manage_fastqs.py``: managing and copy Fastq files *************************************************** The ``manage_fastqs.py`` utility can be used to explore and copy data from an analysis directory to another location on a per-project basis. For example, to get a list of projects within a run:: manage_fastqs.py ANALYSIS_DIR .. note:: If a project contains multiple sets of FASTQs then these will be listed under the project name, with the default "primary" project marked with an asterisk. To see a list of the FASTQ files associated with a particular project:: manage_fastqs.py ANALYSIS_DIR PROJECTNAME To copy the FASTQs to a local or remote directory, use the ``copy`` command:: manage_fastqs.py ANALYSIS_DIR PROJECTNAME copy /path/to/local/dir manage_fastqs.py ANALYSIS_DIR PROJECTNAME copy me@remote.org:/path/to/remote/dir This will also generate an MD5 checksum file for the transferred files; to generate the MD5 sums on their own, use the ``md5`` command:: manage_fastqs.py ANALYSIS_DIR PROJECTNAME md5 Finally to make a zip file containing the Fastqs, use the ``zip`` command:: manage_fastqs.py ANALYSIS_DIR PROJECTNAME zip .. note:: The ``zip`` option works best if the Fastqs are relatively small. Working with multiple Fastq sets -------------------------------- If a project has more than one Fastq set associated with it then by default the operations described above will use the "primary" set (typically, the set of Fastq files in the ``fastqs`` subdirectory of the project). To operate on an alternative set, use the ``--fastq_dir`` option to switch e.g.:: manage_fastqs.py ANALYSIS_DIR PROJECTNAME --fastq_dir=ALT_FASTQS_DIR Handling subsets of files ------------------------- Use the ``--filter`` option to work with a subset of files - this allows a 'glob'-style pattern to be specified so that only files with matching names will be included. For example to only copy ``R1`` files:: manage_fastqs.py ANALYSIS_DIR PROJECTNAME copy /path/to/local/dir --filter *_R1_* .. _transfer_data: ``transfer_data.py``: copying data for transfer to end users ************************************************************ Overview -------- The ``transfer_data.py`` utility can be used to copy data from analysis projects to different destinations, typically to transfer copies of data to end users. A destination is defined as a local or remote directory where files will be copied, for example in its most basic mode: :: transfer_data.py /mnt/data/shared PROJECT_DIR will copy Fastq files from the project referenced by ``PROJECT_PATH`` to the local directory ``/mnt/data/shared``. Destinations can also be defined in the configuration file (see :ref:`data_transfer_destinations`) and then referred to by their name when copying the Fastqs. For example: :: transfer_data.py webserver PROJECT_DIR where ``webserver`` is a pre-defined destination. A job runner can be specified in the configuration file to use for computationally-intensive operations (for example copying large files or creating archives), by setting the ``transfer_data`` runner (see :ref:`job_runners`). If this isn't set then the ``rsync`` runner will be used instead. The ``--dry-run`` option can be used to check what the utility will do without actually performing any operations. Schemes for dymanic subdirectory specification ---------------------------------------------- By default the data are copied directly to the specified directory. However it is possible to specify a scheme for dynamic subdirectory assignment, which can be useful for example if copying to a webserver. The scheme can be specified via either the ``--subdir`` command line option or the ``subdir`` parameter in the configuration file. The following schemes are available: ============== ========================================== Scheme name Behaviour ============== ========================================== ``random_bin`` Locates an empty pre-existing subdirectory (aka 'bin') at random ``run_id`` Creates a new subdirectory named ``PLATFORM_DATESTAMP.RUN_NUMBER-PROJECT`` (must not already exist) ============== ========================================== Generating a README file from a template ---------------------------------------- It is possible to generate a ``README`` for the copied data by specifying a template file via either the ``--readme`` command line option or the ``readme_template`` parameter in the configuration file. The template should be a plain text file but it can also contain placeholders for 'template variables' which will be substituted with the appropriate values when the ``README`` file is generated: ================ ================================= Placeholder Value ================ ================================= ``%PLATFORM%`` Run platform (uppercase) ``%RUN_NUMBER%`` Run number ``%DATESTAMP%`` Run datestamp ``%PROJECT%`` Name of project being copied ``%WEBURL%`` Base URL for the webserver ``%BIN%`` Name of the subdirectory, if any ``%DIR%`` Directory data were copied to ``%TODAY%`` Today's date ================ ================================= Including downloader, QC reports and 10xGenomics pipeline outputs ----------------------------------------------------------------- By default only Fastqs are copied by ``transfer_data.py``, however it is possible to include additional files: * A standalone downloader script (see :ref:`download_fastqs`) (specify the ``--include_downloader`` option or set the ``include_downloader`` parameter in the configuration); * The zipped QC reports for the project (specify the ``--include_qc_report`` option or set the ``include_qc_report`` parameter) * Outputs from 10xGenomics pipelines (e.g. ``cellranger count``) packaged into a ``tgz`` archive (specify the ``--include_10x_outputs`` option) * ``.cloupe`` files from 10xGenomics pipeline outputs collected into a ``.zip`` archive (specify the ``--include_cloupe_files`` option) * Image files in the ``Visium_images`` subdirectory of 10xGenomics Visium datasets packaged into a ``tgz`` archive (specify the ``--include_visium_images`` option) Hard linking Fastqs ------------------- When sharing Fastqs via a local directory which is on the same file system as the original files, it is possible to make hard links to the Fastqs rather than making copies by specifying the ``--link`` option (or setting the ``hard_links`` parameter). Linking Fastqs is quicker than copying and saves space as hard links reference the same copy of the file's data on the file system. Bundling Fastqs into ZIP archives --------------------------------- For datasets with contain very large numbers of Fastq files it may be undesirable to share the individual Fastqs (for example when downloading from a web server, or uploading to a file transfer service such as ZendTo). In these cases the following options can be used: * ``--zip_fastqs`` will bundle the Fastqs into one or more ZIP archives (instead of copying each Fastq individually) * If specified then the ``--max_zip_size`` option additionally sets the maximum size for each ZIP archive, resulting in multiple ZIPs if the dataset cannot be put into a single archive of this size. .. _download_fastqs: ``download_fastqs.py``: fetch Fastqs from a webserver in batch ************************************************************** Fastq files pushed to a webserver using ``manage_fastqs.py`` can be retrieved in batch using the ``download_fastqs.py`` utility:: download_fastqs.py http://example.com/fastqs/ This fetches the checksum file from the URL and then uses that to get a list of Fastq files to download. Once the files are downloaded it runs the Linux ``md5sum`` program to verify the integrity of the downloads. .. note:: This utility is stand-alone so it can be sent to end users and used independently of other components of the autoprocess package. .. _update_project_metadata: ``update_project_metadata.py``: manage metadata associated with a project ************************************************************************* The projects within a run each have a file called ``README.info`` which is used to hold metadata about that project (for example, user, PI, organism, library type and so on). Use the ``update_project_metadata.py`` utility to check and update the metadata associated with a project, for example to update the PI:: update_project_metadata.py ANALYSIS_DIR PROJECT -u PI="Andrew Jones" .. note:: Project directories created using very old versions of ``auto_process``, or predating the automated processing system, might not have metadata files. To create one use:: update_project_metadata.py ANALYSIS_DIR PROJECT -i before using ``-u`` to populate the fields. .. _audit_projects: ``audit_projects.py``: auditing disk usage for multiple runs ************************************************************ Collections of runs that are copied to an 'archive' location via the ``archive`` function of ``auto_process.py`` will form a directory structure of the form:: ARCHIVE_DIR/ | +--- 2015/ | +--- hiseq/ | +--- 150429_HISEQ_XXYYY_12345BB_analysis/ | +--- 150408_HISEQ_XXYYY_67890CC_analysis/ | . Within each run dir there will be one or more project directories. The projects can be audited according to PI and disk usage using the ``audit_projects.py`` utility, for example:: audit_projects.py ARCHIVE_DIR/2015/hiseq/ Multiple directories can be specified, e.g.:: audit_projects.py ARCHIVE_DIR/2015/hiseq/ ARCHIVE_DIR/2014/hiseq/ This will print out a summary of usage for each PI, e.g.:: Summary (PI, # of projects, total usage): ========================================= Peter Brooks 12 3.7T Trevor Smith 8 2.3T Donald Raymond 6 2.2T ... Total usage 164 22.3T plus a breakdown of the usage for each of the projects belonging to each PI, for example:: Breakdown by PI/project: ======================== Peter Brooks: 150121_HISEQ001_0123_ABCD123XX: SteveAustin 128.1G 150306_HISEQ001_0234_ABCD123XX: MartinLouis 159.7G 150415_HISEQ001_0345_ABCD123XX: MartinLouis 72.8G ... There is also a summary of the amount of space used for storing the 'undetermined' read data, for each run. .. note:: The disk usage for each file is calculated by using Python's ``os.lstat`` function to get the number of 512-byte blocks per file. The total usage is then the sum of all the files and directories. However these values can differ from the sizes returned by the Linux ``du`` program, for various reasons including using a different block size (e.g. ``du`` uses 1024-byte blocks). So the returned values should not be treated as absolutes. .. _exporting_to_galaxy: Exporting Fastqs to a data library in a local Galaxy instance ************************************************************* Upload of Fastq files from a run into a data library on a Galaxy instance can be performed using the ``nebulizer`` utility. .. note:: You will need access to an admin account on the target Galaxy server to create and add to the data libraries. The ``create_library`` and ``create_library_folder`` commands can be used to make the target data library and folder, if these don't already exist - for example: :: nebulizer create_library MyGalaxy "MISEQ_190626#26" \ --description "Data from MISEQ run 26 datestamp 190626" nebulizer create_library_folder MyGalaxy "MISEQ_190626#26/Fastqs" would create a data library called *MISEQ_190626#26* on the *MyGalaxy* instance, and a new folder called *Fastqs* within that library. Then the ``add_library_datasets`` command can be used to upload Fastqs to the library. To upload files from the local system to the server: :: nebulizer add_library_datasets MyGalaxy /path/to/fastqs/PB_S1_R1_001.fastq.gz ... If the files are on the same system as the Galaxy server then the ``--server`` option can be used, for example: :: nebulizer add_library_datasets mygalaxy --server Data_Library/Fastqs /path/to/fastqs/on/server/PB_S1_R1_001.fastq.gz ... It is possible in this case to get Galaxy to create links to the Fastqs (rather than making copies) which can potentially save time and disk space, by including the ``--link`` option: :: nebulizer add_library_datasets mygalaxy --server --link Data_Library/Fastqs /path/to/fastqs/on/server/PB_S1_R1_001.fastq .. warning:: Making links only seems to work for uncompressed Fastq files. For information on ``nebulizer`` see https://nebulizer.readthedocs.io/en/latest/