Setting up project directories using auto_process setup_analysis_dirs

Following Fastq generation using the make_fastqs command, typically the next step is to create and populate the project directories ready for subsequent QC and analysis.

This is done using the setup_analysis_dirs command, for example:

auto_process.py setup_analyis_dirs

This reads the projects.info metadata file (initially created by make_fastqs) and creates a new subdirectory for each listed project.

Before runing setup_analysis_dirs, the projects.info file should be edited to fill in the following information for each project:

  • User: the name(s) of the user(s) associated with the project

  • PI: the name(s) of the principal investigator(s) (PIs) associated with the project

  • Library: the library or application type (for example “RNA-seq”, “ChIP-seq” etc)

  • Organism: the organism(s) that the samples in the project originally came from (for example “Human”, “Mouse”, “D. Melanogaster” etc)

  • SC_Platform: the single-cell platform used to prepare the samples (if appropriate).

See Projects metadata file: projects.info for more information on the format of the projects.info file and the allowed values for each field.

Note

setup_analysis_dirs checks that at minimum each projects has non-null values for the user, PI, library and organism fields. To skip this check, specify the --ignore-missing-metadata option.

Note

In addition to the projects listed in projects.info, if the outputs from make_fastqs included ‘undetermined’ Fastqs then these will be copied to an undetermined “project”.

Once the project directories have been created, the next step is to run the QC pipeline - see Running the QC.

Additional files and directories

For certain types of data there are additional files that should be added to the project directory manually after running setup_analysis_dirs.

Note that for some of these files setup_analysis_dirs will generate partially-populated template versions, with the .template extension:

These can be edited and renamed before use in downstream processing stages (e.g. the QC pipeline).

For 10x Genomics Visium data an empty Visium_images subdirectory will also be created (see Processing 10x Genomics Visium spatial transcriptomics data).

Adding an identifier to project directory names

Sometimes it can be useful to create multiple projects in parallel from the same projects.info data (for example, when a run has been processed in several different ways - see Processing a single run multiple times).

In this case the --id option of the setup_analysis_dirs command can be used to create project directories where each name is taken from the projects.info file but has an identifier (e.g. no_trimming) appended.

For example:

auto_process.py setup_analysis_dirs --id=no_trimming

would produce projects named <PROJECT>_no_trimming.

When multiple bcl2fastq output directories exist in the same analysis directory, the --id option can be paired with the --unaligned-dir option to produce sets of projects derived from specific bcl2fastq outputs.

For example:

auto_process.py setup_analysis_dirs \
   --unaligned-dir=bcl2fastq_no_trimming --id=no_trimming