Setting up project directories using `auto_process setup_analysis_dirs`

Following Fastq generation using the make_fastqs command, typically the next step is to create and populate the project directories ready for subsequent QC and analysis.

This is done using the setup_analysis_dirs command, for example:

auto_process.py setup_analyis_dirs

This reads the projects.info metadata file (initially created by make_fastqs) and creates a new subdirectory for each listed project.

Before runing setup_analysis_dirs, the projects.info file should be edited to fill in the following information for each project:

User: the name(s) of the user(s) associated with the project
PI: the name(s) of the principal investigator(s) (PIs) associated with the project
Library: the library or application type (for example “RNA-seq”, “ChIP-seq” etc)
Organism: the organism(s) that the samples in the project originally came from (for example “Human”, “Mouse”, “D. Melanogaster” etc)
SC_Platform: the single-cell platform used to prepare the samples (if appropriate).

See Projects metadata file: projects.info for more information on the format of the projects.info file and the allowed values for each field.

Note

setup_analysis_dirs checks that at minimum each projects has non-null values for the user, PI, library and organism fields. To skip this check, specify the --ignore-missing-metadata option.

Note

In addition to the projects listed in projects.info, if the outputs from make_fastqs included ‘undetermined’ Fastqs then these will be copied to an undetermined “project”.

Once the project directories have been created, the next step is to run the QC pipeline - see Running the QC.

Additional files and directories

For certain types of data there are additional files that should be added to the project directory manually after running setup_analysis_dirs.

Note that for some of these files setup_analysis_dirs will generate partially-populated template versions, with the .template extension:

These can be edited and renamed before use in downstream processing stages (e.g. the QC pipeline).

For 10x Genomics Visium data an empty Visium_images subdirectory will also be created (see Processing 10x Genomics Visium spatial transcriptomics data).

Adding an identifier to project directory names

Sometimes it can be useful to create multiple projects in parallel from the same projects.info data (for example, when a run has been processed in several different ways - see Processing a single run multiple times).

In this case the --id option of the setup_analysis_dirs command can be used to create project directories where each name is taken from the projects.info file but has an identifier (e.g. no_trimming) appended.

For example:

auto_process.py setup_analysis_dirs --id=no_trimming

would produce projects named <PROJECT>_no_trimming.

When multiple bcl2fastq output directories exist in the same analysis directory, the --id option can be paired with the --unaligned-dir option to produce sets of projects derived from specific bcl2fastq outputs.

For example:

auto_process.py setup_analysis_dirs \
   --unaligned-dir=bcl2fastq_no_trimming --id=no_trimming

Setting up project directories using auto_process setup_analysis_dirs

Additional files and directories

Adding an identifier to project directory names

Setting up project directories using `auto_process setup_analysis_dirs`