Setting up project directories using auto_process setup_analysis_dirs
Following Fastq generation using the make_fastqs
command, typically
the next step is to create and populate the project directories ready for
subsequent QC and analysis.
This is done using the setup_analysis_dirs
command, for example:
auto_process.py setup_analyis_dirs
This reads the projects.info
metadata file (initially created by make_fastqs
) and creates a new
subdirectory for each listed project.
Before runing setup_analysis_dirs
, the projects.info
file should
be edited to fill in the following information for each project:
User: the name(s) of the user(s) associated with the project
PI: the name(s) of the principal investigator(s) (PIs) associated with the project
Library: the library or application type (for example “RNA-seq”, “ChIP-seq” etc)
Organism: the organism(s) that the samples in the project originally came from (for example “Human”, “Mouse”, “D. Melanogaster” etc)
SC_Platform: the single-cell platform used to prepare the samples (if appropriate).
See Projects metadata file: projects.info for more information on the
format of the projects.info
file and the allowed values for each
field.
Note
setup_analysis_dirs
checks that at minimum each projects has
non-null values for the user, PI, library and organism fields. To
skip this check, specify the --ignore-missing-metadata
option.
Note
In addition to the projects listed in projects.info
, if the
outputs from make_fastqs
included ‘undetermined’ Fastqs then
these will be copied to an undetermined
“project”.
Once the project directories have been created, the next step is to run the QC pipeline - see Running the QC.
Additional files and directories
For certain types of data there are additional files that should
be added to the project directory manually after running
setup_analysis_dirs
.
Note that for some of these files setup_analysis_dirs
will
generate partially-populated template versions, with the
.template
extension:
These can be edited and renamed before use in downstream processing stages (e.g. the QC pipeline).
For 10x Genomics Visium data an empty Visium_images
subdirectory
will also be created (see Processing 10x Genomics Visium spatial transcriptomics data).
Adding an identifier to project directory names
Sometimes it can be useful to create multiple projects in
parallel from the same projects.info
data (for example,
when a run has been processed in several different ways -
see Processing a single run multiple times).
In this case the --id
option of the setup_analysis_dirs
command can be used to create project directories where
each name is taken from the projects.info
file but has
an identifier (e.g. no_trimming
) appended.
For example:
auto_process.py setup_analysis_dirs --id=no_trimming
would produce projects named <PROJECT>_no_trimming
.
When multiple bcl2fastq
output directories exist in the
same analysis directory, the --id
option can be paired with
the --unaligned-dir
option to produce sets of projects
derived from specific bcl2fastq
outputs.
For example:
auto_process.py setup_analysis_dirs \
--unaligned-dir=bcl2fastq_no_trimming --id=no_trimming