Running the QC standalone with run_qc.py

The utility run_qc.py allows the QC pipeline to be run on an arbitrary set of Fastqs outside of the auto_process pipeline.

The general invocation is:

run_qc.py DIR | FASTQ [ FASTQ ... ]

If DIR is supplied then it should be a directory containing the Fastq files to run the QC on:

run_qc.py /mnt/data/project/fastqs.trimmed/

Alternatively a list of Fastq files can be supplied directly, for example:

run_qc.py /mnt/data/project/fastqs.trimmed/*.fastq

Note

You can use more elaborate shell wildcard patterns to select the Fastqs to pass to the QC, for example:

run_qc.py /mnt/data/project/fastqs.trimmed/{SC1_*,SC2_*}.trimmed.fastq

would only select Fastq files matching SC1_*.trimmed.fastq and SC2_*.trimmed.fastq.

See https://tldp.org/LDP/GNU-Linux-Tools-Summary/html/x11655.htm for more detail on using shell-style wildcards for pattern matching.

Various options are available to control the QC; the following sections outline the most useful - see the documentation in the run_qc.py section for the full set of options.

Specifying the QC metadata

The following options specify metadata for the QC which will determine which metrics are run:

  • --organism: specify the organism(s) (e.g. human, mouse etc; use the --info option to list the organisms available in the current configuration)

  • --library-type: specify the library type (e.g. RNA-seq, ChIP-seq etc)

  • --single-cell-platform: specify the name for the single cell platform (e.g. 10xGenomics Chromium 3'v3; see Projects metadata file: projects.info for a complete list, or use the --info option)

  • --biological-samples: explicitly specify the subset of samples with biological data (as opposed to e.g. feature barcode information) (by default all samples are treated as if they contained biological data)

Note

By default run_qc.py attempts to locate a project metadata file associated with the input files, and will use the metadata defined there unless overridden by the options above; use the --ignore-metadata option to ignore the project metadata even if it is found.

Specifying the outputs

By default the QC reports are written to the current directory, with the QC outputs written to a qc subdirectory.

The following options can be used to override the defaults:

  • -o/--out_dir: sets the top-level directory where the QC reports and outputs are written

  • --qc_dir: sets the location where the QC outputs are written; if this is a relative path then it will be a subdirectory of the top-level output directory

  • --filename: name for the HTML report from the QC

  • --name: sets the name for the project (used in the QC report title)

Updating existing QC outputs

If the output directory for the QC already exists then by default run_qc.py will stop without running the pipeline; to override this behaviour and update an existing QC output directory, specify the -u or --update option.

Specifying reference data

By default the reference data used in the QC pipeline (for example the STAR index or the genome annotation) are determined from the configuration file according to the organism associated with the dataset.

If no matching reference data are found then the following options are available:

  • --star-index: sets the path to the STAR index to use

  • --gtf: sets the path to the genome annotation in GTF format

Note that these options will also override the reference data which are set in the configuration.

Note

The --info option can be used to see what default reference data has been assigned to each configured organism name.

Running 10xGenomics single library analyses

When running the QC pipeline on 10x Genomics data using an appropriate protocol, additional options are available for controlling the single library analyses that are run using the count command of the appropriate 10x Genomics software package.

The following options can be used to override the defaults defined in the configuration:

  • --cellranger: explicitly sets the path to the cellranger (or other appropriate 10xGenomics package)

  • --cellranger-reference: sets the path to the reference dataset to use for single library analysis

For single cell RNA-seq additional options are available:

  • --10x_force_cells: explicitly specify the number of cells, overriding automatic cell detection algorithms (i.e. set the --force-cells option for Cellranger)

  • --10x_chemistry: explicitly set the assay configuration, overriding the automatic assay detection (i.e. set the --chemistry option for Cellranger)

Note

The full outputs from the single library analyses are copied to the cellranger_count subdirectory of the top-level output directory, in addition to the metrics and HTML summary files written to the QC directory.

Running 10x Genomics CellRanger multi analysis

When running the QC pipeline on 10x Genomics data using an appropriate protocol, additional options are available for controlling the CellRanger multi analyses (i.e. the cellranger multi command). (These options supplement those available for the single library analyses in the previous section.)

In order to enable the multi analyses, information on the each of the 10x Genomics libraries associated with each physical sample must be specified via the --10x_library option, using the syntax:

--10x_library [SAMPLE:]FASTQ_ID:FEATURE_TYPE

where SAMPLE is an optional physical sample name (required if there is more than one physical sample), FASTQ_ID is essentially the “sample name” part of the Fastq name, and FEATURE_TYPE is the associated feature type, e.g. Gene expression).

For example:

--10x_library "PB01:PB01_GEX:Gene Expression" --10x_library PB01:PB01_TCR:VDJ-T ...

There should be one --10x_library option supplied for each Fastq ID.

run_qc.py then uses the supplied information to build cellranger multi configuration files for each physical sample, with the feature types assigned within the libraries section.

If the dataset has multiplexed samples (for example CellPlex or Flex data) then their names can be assigned to the appropriate CMO or probeset ID using the --10x_multiplexed_samples option:

--10x_multiplexed_samples [SAMPLE:]MULTIPLEXED_SAMPLE=ID,...

where SAMPLE is an optional physical sample name (required if there is more than one physical sample).

For example:

--10x_multiplexed_samples PJB1:PBA=CMO301,PBB=CMO302

Two additional command options are provided specifically for V(D)J and Flex data:

  • --cellranger-vdj-reference: if V(D)J data is present then this option should be used to specify the location of the V(D)J reference dataset;

  • --cellranger-probest: for Flex data this option can be used to specify the location of the probe set reference data (otherwise the probe set will be taken from the configuration, if possible).

Running on different platforms: --local

By default the QC pipeline will run using the settings from the auto_process configuration file; however it is recommended to use the --local option if running the QC on a local workstation, or within a job submitted to a compute cluster (for example, if running inside another script).

For example: submitting a QC run as a single job on a Grid Engine compute cluster might look like:

qsub -b y -V -pe smp.pe 16 'run_qc.py --local /data/Fastqs'

In this mode the pipeline overrides the central configuration and attempts to adjust parameters for running the QC to suit the local setup.

It should make reasonable guesses for the number of available CPUs and memory. However the following options can be used with --local to override the guesses:

  • --maxcores: sets the maximum number of CPUs available; the QC will not exceed this number when running jobs. If this isn’t set explicitly then the pipeline will attempt to determine the number of CPUs automatically;

  • --maxmem: sets the maximum amount of memory available (in Gbs); currently this is only used if cellranger is being run. If this isn’t set explicitly then the pipeline will attempt to determine the available memory automatically.

Explicitly specifying these parameters for a QC run submitted as a single job on a Grid Engine compute cluster might look like:

qsub -b y -V -pe smp.pe 16 'run_qc.py --local --maxcores=16 --maxmem=64 /data/Fastqs'

Listing organisms and other information: --info

The --info option of run_qc.py displays various items from the current configuration, including a list of the available organisms and the indexes and other reference data assigned to each.

Other information includes the available QC protocols, single ell platforms and FastqScreen .conf files.

Once the information is displayed, run_qc.py will exit without performing any further action.

Per-lane QC: --split-fastqs-by-lane

The --split-fastqs-by-lane forces the pipeline to generate copies of the input Fastqs for each lane that appears in the read headers of each Fastq, and then run the QC on those per-lane Fastqs (rather the originals, which are not changed).

This results in per-lane QC metrics, which can be useful for diagnostic purposes when handling Fastqs which have been merged across multiple lanes (for example, to determine whether contanimation is confined to a single lane).

Specifying and customising the QC protocols

If a QC protocol is not specified explicitly then by default run_qc.py will determine which protocol to use based on the metadata; see Generating QC reports using auto_process run_qc for a complete list of the built-in protocols (or use run_qc.py’s --info option).

This behaviour can be overridden in a number of ways:

  • The --protocol option can be used to explicitly specify one of the built-in QC protocols to use, or

  • The --protocol-spec option can be used to fully specify a custom QC protocol using a specification string (see qc_protocol_specification for details of protocol specification strings), or

  • A combination of the --sequence-reads, --index-reads and/or --qc-modules can be used to customise or specify the QC protocol.

The --sequence-reads and --index-reads options specify which reads contain biological data and which contain only “index” data; the --qc-modules option specify which QC metrics will be generated and reported.

Note

Details of the available QC modules can be found in QC modules.

Note

Using --sequence-reads, --index-reads and --qc-modules will override the relevant parts of the specified or automatically determined QC protocol.

Handling non-standard Fastq file names

By default, the pipeline expects Fastq names to be some variant of the Illumina-style naming structure. Where a collection of Fastq files have names that match one of these variants, the standalone QC is able to pair R1 and R2 Fastqs and run the appropriate QC protocol.

Where the Fastqs don’t use an Illumina or psuedo-Illumina naming format, the pipeline may fail to match R1/R2 read pairs correctly. This may result in an inappropriate QC protocol being selected, or the correct protocol failing to run correctly.

To handle these “non-standard” Fastq file names, it is possible to supply a basic naming pattern to run_qc.py which will tell the utility how to extract sample names and read numbers from the names, via the --fastq-pattern option.

The pattern should be a string consisting of the following elements:

  • {SAMPLE} and {READ}, indicating where the sample name and read numbers are expected;

  • Fixed parts of the names which are present in all names;

  • Wildcards (*) at positions where there may be variations between names which are not relevant to the sample or read numbers.

Some examples of patterns for different file name structures are given in the table below.

Pattern

Example matching Fastqs

{SAMPLE}_{READ}

SMP1-1_1.fastq, SMP1-1_2.fastq

{SAMPLE}_R{READ}

SMP1-1_R1.fastq, SMP1-1_R2.fastq

{SAMPLE}_S*_L*_R{READ}

SMP1-1_S1_L002_R2_001.fq, …

{SAMPLE}_ELK*_L*_{READ}

SMP1_ELK250000280-1B_24EWNTLT3_L2_2.fq.gz, …

Using these patterns will enable the pipeline to correctly pair Fastqs for the QC, and to use the appropriate sample names in the reporting.