QC Reports ========== ******** Overview ******** The :doc:`auto_process run_qc <../using/run_qc>` command outputs an HTML report for the QC for each of the projects in the analysis directory, which enables an assessment of the quality of the Fastq files for each sample in the project and tries to highlight aspects which might pose problems with the data in subsequent analyses. An example of the top of a QC report index page is shown below: .. image:: ../images/qc/qc_report_full.png :align: center **************** Report structure **************** Each report consists of a number of different elements which are shown schematically below: .. image:: ../images/qc/qc_report_schematic.png :align: center The key element is the top-level summary section which itself consists of a number of subsections: * :ref:`qc_report_project_metadata` * :ref:`qc_report_qc_summary_table` * :ref:`qc_report_single_library_analyses` * :ref:`qc_report_dataset_wide_metrics` Below this top-level summary there are more detailed per-sample and per-Fastq reports (see :ref:`qc_report_qc_outputs_per_fastq`). .. _qc_report_project_metadata: ************************************* Metadata, reference data and comments ************************************* This section contains a number of tables and subsections which summarise information associated with the project and QC: * **General information** including the user, PI, library type, organism(s) and QC protocol * **Processing software** and versions (if available) * **QC software** and versions (if available) * **Reference data** such as Cellranger and STAR indexes * **Comments** associated with the project .. _qc_report_qc_summary_table: **************** QC summary table **************** The QC summary table summarises the key QC results for each sample and Fastq or Fastq pair (for paired end data). For example: .. image:: ../images/qc/qc_report_summary.png :align: center The summary includes the following basic information for each sample and Fastq: * Sample name * Fastq names associated with each sample * Number of reads or read pairs for each Fastq * Mean and range of sequence lengths for each Fastq Additionally the following metrics are reported (typically in the form of small summary plots, which are described in the appropriate sections below): * :ref:`qc_report_quality_boxplots` * :ref:`qc_report_fastqc_summary_plots` * :ref:`qc_report_fastq_screen_summary_plots` * :ref:`qc_report_strandedness` * :ref:`qc_report_read_counts` * :ref:`qc_report_sequence_length_dist_plots` * :ref:`qc_report_sequence_duplication_plots` * :ref:`qc_report_adapter_summary_plots` * :ref:`qc_report_insert_size_dist_plots` * :ref:`qc_report_qualimap_rnaseq_coverage` * :ref:`qc_report_qualimap_rnaseq_origin_genomic_reads` One purpose of this table is to help pick up on trends and identify any outliers within the dataset as a whole; hence the main function of these plots are to convey a general sense of the data. Note that not all outputs might appear, depending on the :doc:`QC protocol <../using/run_qc>` that was used. The sample and Fastq names in the table link through to the full QC outputs for the sample or Fastqs in question; other items (e.g. the quality boxplots) link to the relevant parts of the full QC outputs section (see :ref:`qc_report_qc_outputs_per_fastq`). .. _qc_report_quality_boxplots: Quality boxplots ---------------- The summary table includes a small version of the sequence quality boxplot from ``fastqc``, for example: .. image:: ../images/qc/uboxplot.png :align: center A larger version of the plot is presented in the :ref:`qc_report_qc_outputs_per_fastq` section. .. _qc_report_fastqc_summary_plots: Fastqc summary plots -------------------- The output from ``fastqc`` includes a summary table with a set of metrics and an indication of whether the Fastq has passed, failed or triggered a warning for each. The summary table includes a small plot which gives an impression of the overall state of the metrics for each Fastq file, for example: .. image:: ../images/qc/fastqc_uplot.png :align: center Each bar in the plot represents one of the ``fastqc`` metrics, (for example "Basic statistics", "Per base sequence quality", and so on); the colour (red, amber, green) and position (left, centre, right) indicate the status of the metric as determined by ``fastqc``. The data are presented in more detail in a table in the :ref:`qc_report_qc_outputs_per_fastq` section. .. _qc_report_fastq_screen_summary_plots: Fastq_screen summary plots -------------------------- The summary table includes a small plot which represents the outputs from ``fastq_screen``, for example: .. image:: ../images/qc/fastq_screen_uplot.png :align: center The three boxes represent (from left to right) the model organisms, other organisms and rRNA plots produced by ``fastq_screen``. The full plots and links to the raw data for each screen can be found in the :ref:`qc_report_qc_outputs_per_fastq` section. .. _qc_report_strandedness: Strandedness ------------ ``fastq_strand.py`` runs ``STAR`` to get the number of reads which map to the forward and reverse strands; it then calculates a pseudo-percentage ("pseudo" because it can exceed 100%) for foward and reverse. The summary table reports the pseudo-percentages as a barplot with a pair of barplots, where the top bar represents the forward pseudo-percentage and the bottom bar the reverse value. Some examples: .. table:: Example strandedness plots :widths: auto ========================= ======================= Example Interpretation ========================= ======================= |strandedness_forward| Likely forward stranded |strandedness_reverse| Likely reverse stranded |strandedness_no_strand| Likely unstranded ========================= ======================= More detailed information about the strandedness statistics is given in the :ref:`qc_report_qc_outputs_per_fastq` section. .. |strandedness_forward| image:: ../images/qc/strandedness_forward.png .. |strandedness_reverse| image:: ../images/qc/strandedness_reverse.png .. |strandedness_no_strand| image:: ../images/qc/strandedness_no_strand.png .. _qc_report_read_counts: Read count plots ---------------- The read count plots indicate the relative number of reads for each Fastq, and the proportion of those reads which are masked and/or padded. * The solid portion of the bar represents the number of reads in the Fastq file, scaled to the highest number of reads present across all Fastqs in the project (so the largest Fastqs will have a bar consisting entirely of solid colours). * Within the solid portion of each bar, different colours represent the proportion of reads which are either masked (red), padded (orange), or neither masked or padded (green). .. note:: "Masked" reads have sequences which consist entirely of N's (e.g. ``NNNNNNNNNNNNN``), whilst "padded" reads have sequences which have one or more trailing N's (e.g. ``ATTAGGGCCNNNN``). Examples: .. table:: Example read counts plots :widths: auto ============================ =================================== Example Interpretation ============================ =================================== |read_count_uplot| Good data: no masked or padded reads present in Fastq (bar is green) & high number of reads compared to largest Fastq in report (solid portion occupies most of plot) |read_count_uplot_small| Good data: no masked or padded reads but small number of reads compared to largest Fastq in report (solid portion occupies small part of plot) |read_count_uplot_mask_pad1| Reasonable data: only small proportions of masked (red portion of bar) and padded reads (orange portion of bar) & highest number of reads across all Fastqs in report (plot is entirely solid colour) |read_count_uplot_mask_pad2| Poor data: high proportions of masked (red portion of bar) and padded reads (orange portion of bar) ============================ =================================== .. |read_count_uplot| image:: ../images/auto/qc/read_count_uplot.png .. |read_count_uplot_small| image:: ../images/auto/qc/read_count_uplot_small.png .. |read_count_uplot_mask_pad1| image:: ../images/auto/qc/read_count_uplot_masking_and_padding1.png .. |read_count_uplot_mask_pad2| image:: ../images/auto/qc/read_count_uplot_masking_and_padding2.png .. _qc_report_sequence_length_dist_plots: Sequence length distribution plots ---------------------------------- The sequence length distribution plots are histograms showing the relative number of reads with different sequence lengths. The data is analogous to that shown in the `Sequence Length Distribution `_ module of ``fastqc``. Typically for trimmed data the plots will look like e.g.: .. image:: ../images/auto/qc/seq_dist_uplot.png :align: center An example with a range of sequence lengths from an adapter-trimmed miRNA-seq dataset which shows peaks for shorter sequence lengths followed by a long tail: .. image:: ../images/auto/qc/seq_dist_uplot_slewed.png :align: center For untrimmed data or other datasets where all sequences are the same length, plots will look like e.g. .. image:: ../images/auto/qc/seq_dist_uplot_untrimmed.png :align: center .. _qc_report_sequence_duplication_plots: Sequence duplication summary plots ---------------------------------- The sequence duplication summary plots indicate the level of sequence duplication in the data, according to the `Sequence Duplication Levels `_ module of ``fastqc``. The duplication level is the percentage of reads that are would be removed when reads with duplicated sequences (i.e. sequences that appear in multiple reads) are counted as a single read. It is an indication of the number of reads with distinct sequences within the data (as lower duplication indicates fewer distinct sequences). (See the Biostars thread `Revisiting the FastQC read duplication report `_ for more explanation of the deduplication in ``fastqc``.) In the plots the solid portion of the bar represents the fraction of reads removed by deduplication, and the colour of the bar indicates which category the data fall into depending on the level of reads remaining: * Red indicates less than 20% reads remain after deduplication (i.e. more than 80% reads were duplicates) * Orange indicates 20-30% of reads remain (i.e. between 70-80% reads were duplicates) * Blue indicates more than 30% of reads remain (i.e. less than 70% reads were duplicates) .. note:: The thresholds used in this plot differs from those used by ``fastqc``. The background of the plot also uses lighter versions of these colours to indicate the thresholds. For example: .. table:: Example sequence duplication plots :widths: auto ============================ =================================== Example Interpretation ============================ =================================== |dup_uplot_fail| Fail: more than 80% of reads are duplicated |dup_uplot_warn| Warn: between than 70-80% of reads are duplicated |dup_uplot_pass| Pass: less than 70% of reads are duplicated |dup_uplot_bg| Plot background with no data (to show thresholds for pass, warn and fail) ============================ =================================== .. |dup_uplot_pass| image:: ../images/auto/qc/duplication_uplot_pass.png .. |dup_uplot_warn| image:: ../images/auto/qc/duplication_uplot_warn.png .. |dup_uplot_fail| image:: ../images/auto/qc/duplication_uplot_fail.png .. |dup_uplot_bg| image:: ../images/auto/qc/duplication_uplot_bg.png .. _qc_report_adapter_summary_plots: Adapter content summary plots ----------------------------- The adapter content summary plots condense the data from the `Adapter Sequences `_ module of ``fastqc`` into a single metric, to indicate the proportion of adapter sequences in a Fastq file. A single adapter fraction is obtained for each adapter class detected by ``fastqc`` by calculating the fraction of plot area which lies under the curves for each adapter in the "Adapter Content" plots. This is then represented as a bar where the coloured portion corresponds to the fraction for each adapter. .. note:: The colours of the bar match the colours used by ``fastqc`` for different adapter classes. For example: .. table:: Example read counts plots :widths: auto ============================ =================================== Example Interpretation ============================ =================================== |adapter_uplot_no_adptrs| No adapter content detected (bar is grey) |adapter_uplot_adptrs_sml| Small amount of adapter content detected (bar is partially solid, with green colour indicating presence of Nextera Transposase sequences) |adapter_uplot_adptrs_lrg| Significant adapter content detected (more than 50% of the bar is solid, with red colour indicating presence of Illumina Universal Adapter sequences) ============================ =================================== .. |adapter_uplot_no_adptrs| image:: ../images/auto/qc/adapter_uplot_no_adptrs.png .. |adapter_uplot_adptrs_sml| image:: ../images/auto/qc/adapter_uplot_adptrs_sml.png .. |adapter_uplot_adptrs_lrg| image:: ../images/auto/qc/adapter_uplot_adptrs_lrg.png .. _qc_report_insert_size_dist_plots: Insert size distribution plots ------------------------------ These plots are small versions of the insert size distribution histograms output by Picard's ``CollectInsertSizeMetrics`` utility. For example: .. image:: ../images/auto/qc/picard_insert_size_uplot.png :align: center The insert size metrics are also collated across all samples into a single TSV file (see :ref:`qc_collated_picard_insert_sizes`). .. _qc_report_qualimap_rnaseq_coverage: Qualimap coverage plots ----------------------- Plot summarising the mean coverage profile of all transcripts with non-zero coverage as produced by Qualimap's RNA-seq analysis; essentially these are the data from the *coverage_profile_along_genes_(total).txt* output file. For example: .. image:: ../images/auto/qc/qualimap_gene_body_coverage_uplot.png :align: center .. _qc_report_qualimap_rnaseq_origin_genomic_reads: Qualimap origin of genomic reads plots -------------------------------------- Bar chart summarising the genomic origin of reads data from Qualimap's RNA-seq analysis; specifically this indicates the fraction of the read alignments which fall into exonic, intronic and intergenic regions. For example: .. image:: ../images/auto/qc/qualimap_genomic_origin_reads.png :align: center .. _qc_report_single_library_analyses: *********************** Single library analyses *********************** For 10x Genomics datasets single library analyses may also have been performed for each sample using the ``count`` command of the appropriate 10xGenomics pipeline (e.g. ``cellranger`` for scRNA-seq data, ``cellranger-atac`` for scATAC-seq data etc). In these cases an additional summary table will appear in the report with appropriate metrics for each sample along with links to the HTML reports from the ``count`` command. For example, for an scRNA-seq dataset: .. image:: ../images/qc/qc_report_single_library_summary.png :align: center The reported metrics will depend on the pipeline and type of data. Details of the contents of the linked ``web_summary.html`` report can be found in the appropriate documentation for the 10xGenomics pipeline: * ``cellranger``: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/summary * ``cellranger-atac``: https://support.10xgenomics.com/single-cell-atac/software/pipelines/latest/output/summary * ``cellranger-arc``: https://support.10xgenomics.com/single-cell-multiome-atac-gex/software/pipelines/latest/output/summary .. note:: The full set of outputs can be found under the ``cellranger_count`` subdirectory of the project directory, when single library analysis has been performed. .. note:: For single cell multiome datasets there may also be a summary table for CellRanger ARC's single library analyses; similarly for CellPlex datasets a summary table for the multiplexing analysis may be present (in both cases depending on the contents of the QC directory). .. _qc_report_dataset_wide_metrics: ******************************** Dataset-wide metrics and reports ******************************** This section contains any dataset-wide metrics and additional reports, including: * :ref:`qc_rseqc_gene_body_coverage` * :ref:`qc_collated_picard_insert_sizes` * :ref:`qc_multiqc_report` .. _qc_rseqc_gene_body_coverage: RSeQC gene body coverage plot ------------------------------ This is the gene body coverage plot generated by RSeQC's ``genebody_coverage.py`` utility, as a PNG. For example: .. image:: ../images/qc/qc_report_rseqc_gene_body_coverage.png :align: center .. _qc_collated_picard_insert_sizes: Collated insert sizes --------------------- This is a TSV (tab-delimited values) file which contains the following data for each aligned Fastq, from the output of Picard's ``CollectInsertSizeMetrics`` command: * BAM file name * Mean insert size * Standard deviation * Median insert size * Median absolute deviation .. _qc_multiqc_report: MultiQC report -------------- The HTML report generated by MultiQC when run on the QC directory. .. _qc_report_qc_outputs_per_fastq: ************************* Full QC outputs per Fastq ************************* After the summary table, the full QC outputs for each Fastq or Fastq pair are grouped by sample, for example: .. image:: ../images/qc/qc_outputs_per_fastq.png :align: center For each Fastq the subsections consist of: * ``fastqc`` outputs including the sequence quality boxplot and a table of the quality metrics with links to the full report: .. image:: ../images/qc/fastqc_full.png * ``fastq_screen`` outputs for each screen, for example: .. image:: ../images/qc/fastq_screen_full.png * ``fastq_strand`` data: .. image:: ../images/qc/strandedness_full.png