QC Reports

Overview

The auto_process run_qc command outputs an HTML report for the QC for each of the projects in the analysis directory, which enables an assessment of the quality of the Fastq files for each sample in the project and tries to highlight aspects which might pose problems with the data in subsequent analyses.

An example of the top of a QC report index page is shown below:

../_images/qc_report_full.png

Report structure

Each report consists of a number of different elements which are shown schematically below:

../_images/qc_report_schematic.png

The key element is the top-level summary section which itself consists of a number of subsections:

Below this top-level summary there are more detailed per-sample and per-Fastq reports (see Full QC outputs per Fastq).

Metadata, reference data and comments

This section contains a number of tables and subsections which summarise information associated with the project and QC:

  • General information including the user, PI, library type, organism(s) and QC protocol

  • Processing software and versions (if available)

  • QC software and versions (if available)

  • Reference data such as Cellranger and STAR indexes

  • Comments associated with the project

QC summary table

The QC summary table summarises the key QC results for each sample and Fastq or Fastq pair (for paired end data).

For example:

../_images/qc_report_summary.png

The summary includes the following basic information for each sample and Fastq:

  • Sample name

  • Fastq names associated with each sample

  • Number of reads or read pairs for each Fastq

  • Mean and range of sequence lengths for each Fastq

Additionally the following metrics are reported (typically in the form of small summary plots, which are described in the appropriate sections below):

One purpose of this table is to help pick up on trends and identify any outliers within the dataset as a whole; hence the main function of these plots are to convey a general sense of the data.

Note that not all outputs might appear, depending on the QC protocol that was used.

The sample and Fastq names in the table link through to the full QC outputs for the sample or Fastqs in question; other items (e.g. the quality boxplots) link to the relevant parts of the full QC outputs section (see Full QC outputs per Fastq).

Quality boxplots

The summary table includes a small version of the sequence quality boxplot from fastqc, for example:

../_images/uboxplot.png

A larger version of the plot is presented in the Full QC outputs per Fastq section.

Fastqc summary plots

The output from fastqc includes a summary table with a set of metrics and an indication of whether the Fastq has passed, failed or triggered a warning for each.

The summary table includes a small plot which gives an impression of the overall state of the metrics for each Fastq file, for example:

../_images/fastqc_uplot.png

Each bar in the plot represents one of the fastqc metrics, (for example “Basic statistics”, “Per base sequence quality”, and so on); the colour (red, amber, green) and position (left, centre, right) indicate the status of the metric as determined by fastqc.

The data are presented in more detail in a table in the Full QC outputs per Fastq section.

Fastq_screen summary plots

The summary table includes a small plot which represents the outputs from fastq_screen, for example:

../_images/fastq_screen_uplot.png

The three boxes represent (from left to right) the model organisms, other organisms and rRNA plots produced by fastq_screen. The full plots and links to the raw data for each screen can be found in the Full QC outputs per Fastq section.

Strandedness

fastq_strand.py runs STAR to get the number of reads which map to the forward and reverse strands; it then calculates a pseudo-percentage (“pseudo” because it can exceed 100%) for foward and reverse.

The summary table reports the pseudo-percentages as a barplot with a pair of barplots, where the top bar represents the forward pseudo-percentage and the bottom bar the reverse value.

Some examples:

Example strandedness plots

Example

Interpretation

strandedness_forward

Likely forward stranded

strandedness_reverse

Likely reverse stranded

strandedness_no_strand

Likely unstranded

More detailed information about the strandedness statistics is given in the Full QC outputs per Fastq section.

Read count plots

The read count plots indicate the relative number of reads for each Fastq, and the proportion of those reads which are masked and/or padded.

  • The solid portion of the bar represents the number of reads in the Fastq file, scaled to the highest number of reads present across all Fastqs in the project (so the largest Fastqs will have a bar consisting entirely of solid colours).

  • Within the solid portion of each bar, different colours represent the proportion of reads which are either masked (red), padded (orange), or neither masked or padded (green).

Note

“Masked” reads have sequences which consist entirely of N’s (e.g. NNNNNNNNNNNNN), whilst “padded” reads have sequences which have one or more trailing N’s (e.g. ATTAGGGCCNNNN).

Examples:

Example read counts plots

Example

Interpretation

read_count_uplot

Good data: no masked or padded reads present in Fastq (bar is green) & high number of reads compared to largest Fastq in report (solid portion occupies most of plot)

read_count_uplot_small

Good data: no masked or padded reads but small number of reads compared to largest Fastq in report (solid portion occupies small part of plot)

read_count_uplot_mask_pad1

Reasonable data: only small proportions of masked (red portion of bar) and padded reads (orange portion of bar) & highest number of reads across all Fastqs in report (plot is entirely solid colour)

read_count_uplot_mask_pad2

Poor data: high proportions of masked (red portion of bar) and padded reads (orange portion of bar)

Sequence length distribution plots

The sequence length distribution plots are histograms showing the relative number of reads with different sequence lengths. The data is analogous to that shown in the Sequence Length Distribution module of fastqc.

Typically for trimmed data the plots will look like e.g.:

../_images/seq_dist_uplot.png

An example with a range of sequence lengths from an adapter-trimmed miRNA-seq dataset which shows peaks for shorter sequence lengths followed by a long tail:

../_images/seq_dist_uplot_slewed.png

For untrimmed data or other datasets where all sequences are the same length, plots will look like e.g.

../_images/seq_dist_uplot_untrimmed.png

Sequence duplication summary plots

The sequence duplication summary plots indicate the level of sequence duplication in the data, according to the Sequence Duplication Levels module of fastqc.

The duplication level is the percentage of reads that are would be removed when reads with duplicated sequences (i.e. sequences that appear in multiple reads) are counted as a single read. It is an indication of the number of reads with distinct sequences within the data (as lower duplication indicates fewer distinct sequences).

(See the Biostars thread Revisiting the FastQC read duplication report for more explanation of the deduplication in fastqc.)

In the plots the solid portion of the bar represents the fraction of reads removed by deduplication, and the colour of the bar indicates which category the data fall into depending on the level of reads remaining:

  • Red indicates less than 20% reads remain after deduplication (i.e. more than 80% reads were duplicates)

  • Orange indicates 20-30% of reads remain (i.e. between 70-80% reads were duplicates)

  • Blue indicates more than 30% of reads remain (i.e. less than 70% reads were duplicates)

Note

The thresholds used in this plot differs from those used by fastqc.

The background of the plot also uses lighter versions of these colours to indicate the thresholds.

For example:

Example sequence duplication plots

Example

Interpretation

dup_uplot_fail

Fail: more than 80% of reads are duplicated

dup_uplot_warn

Warn: between than 70-80% of reads are duplicated

dup_uplot_pass

Pass: less than 70% of reads are duplicated

dup_uplot_bg

Plot background with no data (to show thresholds for pass, warn and fail)

Adapter content summary plots

The adapter content summary plots condense the data from the Adapter Sequences module of fastqc into a single metric, to indicate the proportion of adapter sequences in a Fastq file.

A single adapter fraction is obtained for each adapter class detected by fastqc by calculating the fraction of plot area which lies under the curves for each adapter in the “Adapter Content” plots. This is then represented as a bar where the coloured portion corresponds to the fraction for each adapter.

Note

The colours of the bar match the colours used by fastqc for different adapter classes.

For example:

Example read counts plots

Example

Interpretation

adapter_uplot_no_adptrs

No adapter content detected (bar is grey)

adapter_uplot_adptrs_sml

Small amount of adapter content detected (bar is partially solid, with green colour indicating presence of Nextera Transposase sequences)

adapter_uplot_adptrs_lrg

Significant adapter content detected (more than 50% of the bar is solid, with red colour indicating presence of Illumina Universal Adapter sequences)

Insert size distribution plots

These plots are small versions of the insert size distribution histograms output by Picard’s CollectInsertSizeMetrics utility.

For example:

../_images/picard_insert_size_uplot.png

The insert size metrics are also collated across all samples into a single TSV file (see Collated insert sizes).

Qualimap coverage plots

Plot summarising the mean coverage profile of all transcripts with non-zero coverage as produced by Qualimap’s RNA-seq analysis; essentially these are the data from the coverage_profile_along_genes_(total).txt output file.

For example:

../_images/qualimap_gene_body_coverage_uplot.png

Qualimap origin of genomic reads plots

Bar chart summarising the genomic origin of reads data from Qualimap’s RNA-seq analysis; specifically this indicates the fraction of the read alignments which fall into exonic, intronic and intergenic regions.

For example:

../_images/qualimap_genomic_origin_reads.png

Single library analyses

For 10x Genomics datasets single library analyses may also have been performed for each sample using the count command of the appropriate 10xGenomics pipeline (e.g. cellranger for scRNA-seq data, cellranger-atac for scATAC-seq data etc).

In these cases an additional summary table will appear in the report with appropriate metrics for each sample along with links to the HTML reports from the count command. For example, for an scRNA-seq dataset:

../_images/qc_report_single_library_summary.png

The reported metrics will depend on the pipeline and type of data.

Details of the contents of the linked web_summary.html report can be found in the appropriate documentation for the 10xGenomics pipeline:

Note

The full set of outputs can be found under the cellranger_count subdirectory of the project directory, when single library analysis has been performed.

Note

For single cell multiome datasets there may also be a summary table for CellRanger ARC’s single library analyses; similarly for CellPlex datasets a summary table for the multiplexing analysis may be present (in both cases depending on the contents of the QC directory).

Dataset-wide metrics and reports

This section contains any dataset-wide metrics and additional reports, including:

RSeQC gene body coverage plot

This is the gene body coverage plot generated by RSeQC’s genebody_coverage.py utility, as a PNG.

For example:

../_images/qc_report_rseqc_gene_body_coverage.png

Collated insert sizes

This is a TSV (tab-delimited values) file which contains the following data for each aligned Fastq, from the output of Picard’s CollectInsertSizeMetrics command:

  • BAM file name

  • Mean insert size

  • Standard deviation

  • Median insert size

  • Median absolute deviation

MultiQC report

The HTML report generated by MultiQC when run on the QC directory.

Full QC outputs per Fastq

After the summary table, the full QC outputs for each Fastq or Fastq pair are grouped by sample, for example:

../_images/qc_outputs_per_fastq.png

For each Fastq the subsections consist of:

  • fastqc outputs including the sequence quality boxplot and a table of the quality metrics with links to the full report:

    ../_images/fastqc_full.png
  • fastq_screen outputs for each screen, for example:

    ../_images/fastq_screen_full.png
  • fastq_strand data:

    ../_images/strandedness_full.png