Starting an analysis using `auto_process setup`

The setup command is used to start a new analysis: it initialises a new analysis directory for processing a sequencing run. Most subsequent auto_process commands (for example make_fastqs) are normally issued from within the analysis directory.

The simplest invocation of the command is:

auto_process.py setup DATA_DIR

where DATA_DIR specifies the location of the data source (i.e. the top-level directory containing the output from the sequencer run which is to be processed). For example:

auto_process.py setup /mnt/data/seqruns/180817_M00123_0001_000000000-BV1X2

DATA_DIR can be a local or a remote directory; see Specifying a remote sequencing run directory.

By default the name of the new analysis directory will consist of the basename of DATA_DIR with the suffix _analysis appended, for example the command above will produce an analysis directory called

180817_M00123_0001_000000000-BV1X2_analysis

To create an analysis directory with a different name, specify it using the --analysis-dir option:

auto_process.py setup DATA_DIR --analysis-dir ANALYSIS_DIR

See Analysis Directories for details of the analysis directory structure.

The setup command will also report the expected outputs based on the sample sheet associated with the sequencing run, for example:

Predicted projects:
===================
- JohnBleakley
- LauraBridges
- MarcusDreng
- StevenYound

JohnBleakley (24 samples)
-------------------------
JB1  S11     TGCGGCGT-TACCGAGG       L3,4
JB2  S12     CATAATAC-CGTTAGAA       L3,4
JB3  S13     GATCTATC-AGCCTCAT       L3,4
JB4  S14     AGCTCGCT-GATTCTGC       L3,4
JB5  S15     CGGAACTG-TCGTAGTG       L3,4
JB6  S16     TAAGGTCA-CTACGACA       L3,4
...

It will also flag up any potential issues (for example if two project names are very similar then this might indicate a typo in a project name in the sample sheet).

Note

The output prediction can also be generated using auto_process samplesheet command.

Specifying the sample sheet file

The setup command will try to locate the sample sheet within the source data and will use this by default.

However if the sequencing run doesn’t include a sample sheet file (for example, NextSeq runs), or if you want to use an alternative sample sheet, then the -s/--sample-sheet option can be used to explicitly specify the sample sheet file.

For example:

auto_process.py setup \
   --sample-sheet /mnt/data/samplesheets/SampleSheet_180817.csv \
   /mnt/data/seqruns/180817_M00123_0001_000000000-BV1X2

The sample sheet can also be on a remote system, for example:

auto_process.py setup \
   --sample-sheet pjb@kellerman.man.ac.uk:samplesheets/SampleSheet_180817.csv \
   /mnt/data/seqruns/180817_M00123_0001_000000000-BV1X2

or it can be a URL:

auto_process.py setup \
   --sample-sheet https://example.com/samplesheets/SampleSheet_180817.csv \
   /mnt/data/seqruns/180817_M00123_0001_000000000-BV1X2

Specifying a remote sequencing run directory

For data on a remote system which is accessible via ssh, the DATA_DIR can be specified using the general syntax

[[USER@]HOST:]DATA_DIR

For example:

pjb@kellerman.man.ac.uk:/mnt/data/seqruns/180817_M00123_0001_000000000-BV1X2

(The --sample-sheet option accepts the same syntax.)

Note

It is recommended that either passwordless ssh access is configured, or that ssh-agent is used for the current session, to suppress multiple password prompts each time the remote system is accessed.

Specifying the facility run number

The facility run number can be explicitly specified using the -r/--run-number option of the setup command.

Specifying the analysis number

An arbitrary number can be assigned to the analysis using the -n/--analysis-number option of the setup command.

Note

If an analysis number is assigned at setup then it will be appended to the analysis directory name, unless this is overridden by the --analysis-dir option.

Specifying additional files

If additional files are required for processing or downstream analysis (e.g. well list files) then the -f/--file option of the setup command can be used to specify one or more additional files which will be copied into the analysis directory.

For example:

auto_process.py setup \
   --file WTA_probe_allocation.xlsx \
   /mnt/data/seqruns/180817_M00123_0001_000000000-BV1X2

Files can be either be local or on a remote system, or can be specified as URLs. Multiple --file options can be specified to import more than one file.

Setup from existing bcl2fastq output

A new analysis directory can be created from an existing bcl2fastq output directory using the --fastq-dir option, which should be used to specify the subdirectory of the DATA_DIR which contains the output Fastq files.

For example:

auto_process.py setup \
   --fastq-dir bcl2fastq2 \
   /mnt/data/seqruns/180817_M00123_0001_000000000-BV1X2

where bcl2fastq2 is the output directory from the BCL-to-Fastq conversion software, within the run data directory 180817_M00123_0001_000000000-BV1X2.

Starting an analysis using auto_process setup

Specifying the sample sheet file

Specifying a remote sequencing run directory

Specifying the facility run number

Specifying the analysis number

Specifying additional files

Setup from existing bcl2fastq output

Starting an analysis using `auto_process setup`