QC Reports

Overview

The auto_process run_qc command outputs an HTML report for the QC for each of the projects in the analysis directory, which enables an assessment of the quality of the Fastq files for each sample in the project and tries to highlight aspects which might pose problems with the data in subsequent analyses.

An example of the top of a QC report index page is shown below:

Report structure

Each report consists of a number of different elements which are shown schematically below:

The key element is the top-level summary section which itself consists of a number of subsections:

Metadata, reference data and comments
QC summary table
Single library analyses
Dataset-wide metrics and reports

Below this top-level summary there are more detailed per-sample and per-Fastq reports (see Full QC outputs per Fastq).

Metadata, reference data and comments

This section contains a number of tables and subsections which summarise information associated with the project and QC:

General information including the user, PI, library type, organism(s) and QC protocol
Processing software and versions (if available)
QC software and versions (if available)
Reference data such as Cellranger and STAR indexes
Comments associated with the project

QC summary table

The QC summary table summarises the key QC results for each sample and Fastq or Fastq pair (for paired end data).

For example:

The summary includes the following basic information for each sample and Fastq:

Sample name
Fastq names associated with each sample
Number of reads or read pairs for each Fastq
Mean and range of sequence lengths for each Fastq

Additionally the following metrics are reported (typically in the form of small summary plots, which are described in the appropriate sections below):

Quality boxplots
Fastqc summary plots
Fastq_screen summary plots
Strandedness
Read count plots
Sequence length distribution plots
Sequence duplication summary plots
Adapter content summary plots
Insert size distribution plots
Qualimap coverage plots
Qualimap origin of genomic reads plots

One purpose of this table is to help pick up on trends and identify any outliers within the dataset as a whole; hence the main function of these plots are to convey a general sense of the data.

Note that not all outputs might appear, depending on the QC protocol that was used.

The sample and Fastq names in the table link through to the full QC outputs for the sample or Fastqs in question; other items (e.g. the quality boxplots) link to the relevant parts of the full QC outputs section (see Full QC outputs per Fastq).

Quality boxplots

The summary table includes a small version of the sequence quality boxplot from fastqc, for example:

A larger version of the plot is presented in the Full QC outputs per Fastq section.

Fastqc summary plots

The output from fastqc includes a summary table with a set of metrics and an indication of whether the Fastq has passed, failed or triggered a warning for each.

The summary table includes a small plot which gives an impression of the overall state of the metrics for each Fastq file, for example:

Each bar in the plot represents one of the fastqc metrics, (for example “Basic statistics”, “Per base sequence quality”, and so on); the colour (red, amber, green) and position (left, centre, right) indicate the status of the metric as determined by fastqc.

The data are presented in more detail in a table in the Full QC outputs per Fastq section.

Fastq_screen summary plots

The summary table includes a small plot which represents the outputs from fastq_screen, for example:

The three boxes represent (from left to right) the model organisms, other organisms and rRNA plots produced by fastq_screen. The full plots and links to the raw data for each screen can be found in the Full QC outputs per Fastq section.

Strandedness

fastq_strand.py runs STAR to get the number of reads which map to the forward and reverse strands; it then calculates a pseudo-percentage (“pseudo” because it can exceed 100%) for foward and reverse.

The summary table reports the pseudo-percentages as a barplot with a pair of barplots, where the top bar represents the forward pseudo-percentage and the bottom bar the reverse value.

Some examples:

Example strandedness plots
Example	Interpretation
	Likely forward stranded
	Likely reverse stranded
	Likely unstranded

More detailed information about the strandedness statistics is given in the Full QC outputs per Fastq section.

Read count plots

The read count plots indicate the relative number of reads for each Fastq, and the proportion of those reads which are masked and/or padded.

The solid portion of the bar represents the number of reads in the Fastq file, scaled to the highest number of reads present across all Fastqs in the project (so the largest Fastqs will have a bar consisting entirely of solid colours).
Within the solid portion of each bar, different colours represent the proportion of reads which are either masked (red), padded (orange), or neither masked or padded (green).

Note

“Masked” reads have sequences which consist entirely of N’s (e.g. NNNNNNNNNNNNN), whilst “padded” reads have sequences which have one or more trailing N’s (e.g. ATTAGGGCCNNNN).

Examples:

Example read counts plots
Example	Interpretation
	Good data: no masked or padded reads present in Fastq (bar is green) & high number of reads compared to largest Fastq in report (solid portion occupies most of plot)
	Good data: no masked or padded reads but small number of reads compared to largest Fastq in report (solid portion occupies small part of plot)
	Reasonable data: only small proportions of masked (red portion of bar) and padded reads (orange portion of bar) & highest number of reads across all Fastqs in report (plot is entirely solid colour)
	Poor data: high proportions of masked (red portion of bar) and padded reads (orange portion of bar)

Sequence length distribution plots

The sequence length distribution plots are histograms showing the relative number of reads with different sequence lengths. The data is analogous to that shown in the Sequence Length Distribution module of fastqc.

Typically for trimmed data the plots will look like e.g.:

An example with a range of sequence lengths from an adapter-trimmed miRNA-seq dataset which shows peaks for shorter sequence lengths followed by a long tail:

For untrimmed data or other datasets where all sequences are the same length, plots will look like e.g.

Sequence duplication summary plots

The sequence duplication summary plots indicate the level of sequence duplication in the data, according to the Sequence Duplication Levels module of fastqc.

The duplication level is the percentage of reads that are would be removed when reads with duplicated sequences (i.e. sequences that appear in multiple reads) are counted as a single read. It is an indication of the number of reads with distinct sequences within the data (as lower duplication indicates fewer distinct sequences).

(See the Biostars thread Revisiting the FastQC read duplication report for more explanation of the deduplication in fastqc.)

In the plots the solid portion of the bar represents the fraction of reads removed by deduplication, and the colour of the bar indicates which category the data fall into depending on the level of reads remaining:

Red indicates less than 20% reads remain after deduplication (i.e. more than 80% reads were duplicates)
Orange indicates 20-30% of reads remain (i.e. between 70-80% reads were duplicates)
Blue indicates more than 30% of reads remain (i.e. less than 70% reads were duplicates)

Note

The thresholds used in this plot differs from those used by fastqc.

The background of the plot also uses lighter versions of these colours to indicate the thresholds.

For example:

Example sequence duplication plots
Example	Interpretation
	Fail: more than 80% of reads are duplicated
	Warn: between than 70-80% of reads are duplicated
	Pass: less than 70% of reads are duplicated
	Plot background with no data (to show thresholds for pass, warn and fail)

Adapter content summary plots

The adapter content summary plots condense the data from the Adapter Sequences module of fastqc into a single metric, to indicate the proportion of adapter sequences in a Fastq file.

A single adapter fraction is obtained for each adapter class detected by fastqc by calculating the fraction of plot area which lies under the curves for each adapter in the “Adapter Content” plots. This is then represented as a bar where the coloured portion corresponds to the fraction for each adapter.

Note

The colours of the bar match the colours used by fastqc for different adapter classes.

For example:

Example read counts plots
Example	Interpretation
	No adapter content detected (bar is grey)
	Small amount of adapter content detected (bar is partially solid, with green colour indicating presence of Nextera Transposase sequences)
	Significant adapter content detected (more than 50% of the bar is solid, with red colour indicating presence of Illumina Universal Adapter sequences)

Insert size distribution plots

These plots are small versions of the insert size distribution histograms output by Picard’s CollectInsertSizeMetrics utility.

For example:

The insert size metrics are also collated across all samples into a single TSV file (see Collated insert sizes).

Qualimap coverage plots

Plot summarising the mean coverage profile of all transcripts with non-zero coverage as produced by Qualimap’s RNA-seq analysis; essentially these are the data from the coverage_profile_along_genes_(total).txt output file.

For example:

../_images/qualimap_gene_body_coverage_uplot.png

Qualimap origin of genomic reads plots

Bar chart summarising the genomic origin of reads data from Qualimap’s RNA-seq analysis; specifically this indicates the fraction of the read alignments which fall into exonic, intronic and intergenic regions.

For example:

../_images/qualimap_genomic_origin_reads.png

Single library analyses

For 10x Genomics datasets single library analyses may also have been performed for each sample using the count command of the appropriate 10xGenomics pipeline (e.g. cellranger for scRNA-seq data, cellranger-atac for scATAC-seq data etc).

In these cases an additional summary table will appear in the report with appropriate metrics for each sample along with links to the HTML reports from the count command. For example, for an scRNA-seq dataset:

../_images/qc_report_single_library_summary.png

The reported metrics will depend on the pipeline and type of data.

Details of the contents of the linked web_summary.html report can be found in the appropriate documentation for the 10xGenomics pipeline:

cellranger: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/summary

cellranger-atac: https://support.10xgenomics.com/single-cell-atac/software/pipelines/latest/output/summary

cellranger-arc: https://support.10xgenomics.com/single-cell-multiome-atac-gex/software/pipelines/latest/output/summary

Note

The full set of outputs can be found under the cellranger_count subdirectory of the project directory, when single library analysis has been performed.

Note

For single cell multiome datasets there may also be a summary table for CellRanger ARC’s single library analyses; similarly for CellPlex datasets a summary table for the multiplexing analysis may be present (in both cases depending on the contents of the QC directory).

Dataset-wide metrics and reports

This section contains any dataset-wide metrics and additional reports, including:

RSeQC gene body coverage plot
Collated insert sizes
MultiQC report

RSeQC gene body coverage plot

This is the gene body coverage plot generated by RSeQC’s genebody_coverage.py utility, as a PNG.

For example:

../_images/qc_report_rseqc_gene_body_coverage.png

Collated insert sizes

This is a TSV (tab-delimited values) file which contains the following data for each aligned Fastq, from the output of Picard’s CollectInsertSizeMetrics command:

BAM file name
Mean insert size
Standard deviation
Median insert size
Median absolute deviation

MultiQC report

The HTML report generated by MultiQC when run on the QC directory.

Full QC outputs per Fastq

After the summary table, the full QC outputs for each Fastq or Fastq pair are grouped by sample, for example:

For each Fastq the subsections consist of:

fastqc outputs including the sequence quality boxplot and a table of the quality metrics with links to the full report:
fastq_screen outputs for each screen, for example:
fastq_strand data: