Barcode Analysis Reports ======================== The demultiplexing stage of processing assigns reads to different samples by matching the index sequence (aka "barcode index") of each read against the reference index sequence for the samples, as defined in the input ``SampleSheet.csv`` file. Reads which cannot be matched against any sample are assigned to the "undetermined" category. Typically the percentage of these undetermined reads should be low compared to the total number of reads. However in cases where demultiplexing hasn't worked well, the number of undetermined reads may be unusually high (typically corresponding with an unexpectedly low number of reads assigned to one or more samples). This can happen artifically because of errors in the sample sheet, or because of errors in the sequencing. In these cases the barcode analysis report can be useful for diagnosing and understanding various problems with the demultiplexing, by identifying the following issues: * **Underrepresented samples**: samples which have extremely low or zero numbers of reads assigned to them; * **Overrepresented unassigned index sequences**: barcodes which have high numbers of associated reads but which aren't assigned to a sample in the sample sheet. .. note:: By default the analysis discounts *individual* barcode sequences with extremely low associated reads (less than 0.0001% of the total number of reads) before grouping similar sequences according to allowed mismatches. This is done to remove large numbers of spurious sequences which would otherwise cause the analysis to take much longer to complete. However it also means that the report read counts are no longer accurate, so they should be treated as approximations and are likely to underreport the number of reads compared with those reported in the per-sample statistics. Barcode analysis is normally run automatically as part of the :doc:`auto_process make_fastqs <../using/make_fastqs>` command (for standard sequencing runs). It can also be run as a separate operation using the :ref:`auto_process.py analyse_barcodes ` command. The analysis counts the reads associated with each barcode sequence in each lane across all the FASTQs generated for those lanes. It then and reports the "top" (i.e. most frequent) barcode sequences across the run, for example: :: Barcodes which cover less than 0.1% of reads have been excluded Reported barcodes cover 93.9% of the data #Rank Index Sample N_seqs N_reads %reads (%Total_reads) 1 AAGAT KL05_L1 1 17623152 9.5% (9.5%) 2 CAATG KL08_L1 1 12784251 6.9% (16.4%) 3 TAGTA KL06_L1 1 12361719 6.7% (23.1%) 4 GTATT KL15_L1 1 10011071 5.4% (28.5%) ... If an index sequence corresponds to a sample defined in the sample sheet, then the name of the sample is also shown. When demultiplexing has been successful, the top ranked barcodes should all correspond to sample names. Any gaps or samples missing from the top barcodes indicate that there may be a problems. For example: :: #Rank Index Sample N_seqs N_reads %reads (%Total_reads) 1 GCTACGCT+CTAAGCCT SM9 1 2894178 13.3% (13.3%) 2 CAGAGAGG+AGAGTAGA SM8 1 2454006 11.3% (24.6%) 3 TCCTGAGC+CTCTCTAT SM4 1 2241825 10.3% (34.9%) 4 AGGCAGAA+CTCTCTAT SM3 1 2206475 10.1% (45.0%) 5 CGAGGCTG+CTAAGCCT 1 2141506 9.8% (54.9%) 6 CTCTCTAC+AGAGTAGA SM7 1 2094131 9.6% (64.5%) 7 TAGGCATG+TATCCTCT SM6 1 1943974 8.9% (73.4%) 8 GGACTCCT+TATCCTCT SM5 1 1868127 8.6% (82.0%) 9 CGTACTAG+TAGATCGC SM2 1 1659786 7.6% (89.7%) 10 TAAGGCGA+TAGATCGC 1 942645 4.3% (94.0%) 11 CGGCAGAA+CTCTCTAT 1 61005 0.3% (94.3%) 12 CAGAGAGG+CGAGTAGA 1 38768 0.2% (94.5%) 13 CTCTCTAC+CGAGTAGA 1 26927 0.1% (94.6%) Here the 5th ranked barcode sequence doesn't have an associated sample name. This overrepresented sequence is highlighted after the list: :: The following unassigned barcodes are overrepresented compared to the assigned barcodes: #Index N_reads %reads CGAGGCTG+CTAAGCCT 2141506 9.84% In addition the analysis reports any underrepresented and missing samples, for example: :: The following samples are underrepresented: #Sample Index N_reads %reads SM1 CAGAGAGG+AGAGTAGT 9971 0.05% SM10 CGAGGCTG+AAGCCTCT <0.05% .. note:: Note that if weeding of the initial data has been performed (which is the default) then the counts are only approximate and so it's not possible to say if underrepresented samples with zero counts are really missing. Therefore they are reported as having read fractions below the cutoff threshold, as in the example above. The analysis is performed in the ``barcode_analysis`` subdirectory by default, where the report is written as text, XLS and HTML files. The raw counts are also cached in files in the ``barcode_analysis/counts/`` subdirectory. This means that the analysis can be rerun quickly with different parameters (for example a different sample sheet or cut-off) if desired. Tuning the reporting -------------------- The ``analyse_barcodes`` command supports a number of options to allow "tuning" of the analysis and reporting: * ``--cutoff``: by default barcodes are excluded from the analysis if they are associated with fewer than 0.1% of all the reads counted. Use this option to adjust the read limit to report more or fewer barcodes. * ``--mismatches``: by default only exact matches to barcode sequences are considered (i.e. zero mismatches). If the number of mismatches (as specified by this option) are more than zero then similar barcodes will be grouped together. * ``--lanes``: by default all barcodes and counts are combined and analysed together. Use this option to restrict analysis to a subset of lanes. * ``--sample-sheet``: by default the sample names and corresponding reference barcodes are taked from the sample sheet used in the ``make_fastqs`` stage; this option can be used to specify a different sample sheet file to use. .. note:: When ``--cutoff`` is used with ``--mismatches`` then the read cutoff is applied **after** the grouping. Running analyse_barcodes.py directly ------------------------------------ The ``analyse_barcodes.py`` utility can be run directly, either using the cached counts produced by the ``analyse_barcodes`` command, or from scratch, to perform more nuanced analyses of the barcode sequences if required. ``analyse_barcodes.py`` operates in two stages: 1. Counting the frequency of each index sequence in the target FASTQs 2. Analysis of these raw counts to determine groups of similar barcode sequences (optionally), and report the most numerous barcodes (or groups) matched against reference sequences from a sample sheet. The first stage can be very time consuming so the counts can be output to an intermediate ``counts`` file using the ``-o`` option:: analyse_barcodes.py ... -o SAMPLE.counts SAMPLE.fq .. note:: To suppress analysis and reporting when generating counts use the ``--no-report`` option. Multiple analyses can be performed using the cached counts, which are reloaded into the program using the ``-c`` option:: analyse_barcodes.py ... -c SAMPLE.counts Multiple counts files can be combined via the ``-c`` option:: analyse_barcodes.py ... -c SAMPLE_1.counts SAMPLE_2.counts ... .. note:: The ``analyse_barcodes`` command generates counts files for each FASTQ file, in the ``barcode_analysis/counts/`` directory, using the naming convention of ``FASTQ.counts``. By default the results of the analysis are written to stdout; use the ``-r`` option to specify an output file instead. Analysing undetermined barcodes only ------------------------------------ Currently this can be done by running the ``analyse_barcodes.py`` utility directly on the cached counts for just the "undetermined" FASTQ files, for example: :: analyse_barcodes.py -c barcode_analysis/Undetermined*.counts Advanced: analysing sequences instead of barcodes ------------------------------------------------- The ``analyse_barcodes.py`` utility can be used to count sequences (or sequence fragments) as well as barcodes, by specifying the ``--sequences`` command line option. Analysis of partial sequences can be performed by combining this option with the ``--seq-start`` and ``--seq-end`` options to specify the limits of the sequence fragments to be extracted and counted. For example: :: analysis_barcodes.py SAMPLE.fq\ --sequences --seq-start=5 --seq-end=20 \ -o sequences.counts will count the sequence fragments from ``SAMPLE.fq``, taken from the 5th to the 20th base in each read.