auto_process_ngs.qc.fastqc

class auto_process_ngs.qc.fastqc.Fastqc(fastqc_dir)

Wrapper class for handling outputs from FastQC

The Fastqc object gives access to various aspects of the outputs of the FastQC program.

adapter_content_plot(inline=False)
property data

Return a FastqcData instance

property dir

Path to the directory with the FastQC outputs

property html_report

Path to the associated HTML report file

plot(module, inline=False)
quality_boxplot(inline=False)
property summary

Return a FastqcSummary instance

property version

Version of FastQC that was used

property zip

Path to the associated ZIP archive

class auto_process_ngs.qc.fastqc.FastqcData(data_file)

Class representing data from a Fastqc data file

Reads in the data from a fastqc_data.txt file and makes it available programmatically.

To create a new FastqcData instance:

>>> fqc = FastqcData('fastqc_data.txt')

To access a field in the ‘Basic Statistics’ module:

>>> nreads = fqc.basic_statistics('Total Sequences')
adapter_content_summary()

Return summary data for adapter content

Summarises the amount of adapter present in a Fastq file based on data in the Adapter Content section, assigning a decimal fraction for each adapter class.

The fraction is calculated by summing the fraction of adapter across all bases, and then normalising by the number of bases.

Returns:

mapping adapter names to the

fraction representing the amount of adapter present in the Fastq.

Return type:

OrderedDict

basic_statistics(measure)

Access a data item in the Basic Statistics section

Possible values include:

  • Filename

  • File type

  • Encoding

  • Total Sequences

  • Sequences flagged as poor quality

  • Sequence length

  • %GC

Parameters:

measure (str) – key corresponding to a ‘measure’ in the Basic Statistics section.

Returns:

value of the requested ‘measure’

Return type:

String

Raises:

KeyError – if ‘measure’ is not found.

data(module)

Return the raw data for a module

Returns the data for the specified module as a list of lines.

The first list item/line is the header line; data items within each line are tab-delimited.

For example:

>>> Fastqc('myfastq_fastq').data.data('Sequence Length Distribution')
['#Length       Count',
 '35    8826.0',
 '36    2848.0',
 '37    4666.0',
 '38    4524.0']
property modules

List of the modules in the raw data

property path

Path to the fastqc_data.txt file

sequence_deduplication_percentage()

Return sequence deduplication percentage

Returns the percentage of sequences remaining after deduplication according to FastQC.

Returns:

percentage sequence deduplication

from FastQC.

Return type:

Float

property version

FastQC version number

class auto_process_ngs.qc.fastqc.FastqcSummary(summary_file=None)

Class representing data from a Fastqc summary file

property failures

Return modules with failures

Returns a list with the names of the modules that have status ‘FAIL’.

html_report()

Return the path of the HTML report from FastQC

html_table(relpath=None)

Generate HTML table for FastQC summary

Parameters:

relpath (str) – optional, if supplied then links in the table will be relative to this path

Return link to the result of a specified FastQC module

Parameters:
  • name (str) – name of the module (e.g. ‘Basic Statistics’)

  • full_path (boolean) – optional, if True then return the full path; otherwise return just the anchor (e.g. ‘#M1’)

  • relpath (str) – optional, if supplied then specifies the path that full paths will be made relative to (implies full_path is True)

property passes

Return modules with passes

Returns a list with the names of the modules that have status ‘PASS’.

property warnings

Return modules with warnings

Returns a list with the names of the modules that have status ‘WARN’.

auto_process_ngs.qc.fastqc.logger = <Logger auto_process_ngs.qc.fastqc (WARNING)>

Example Fastqc summary text file (FASTQ_fastqc/summary.txt):

PASS Basic Statistics ES1_GTCCGC_L008_R1_001.fastq.gz PASS Per base sequence quality ES1_GTCCGC_L008_R1_001.fastq.gz PASS Per tile sequence quality ES1_GTCCGC_L008_R1_001.fastq.gz PASS Per sequence quality scores ES1_GTCCGC_L008_R1_001.fastq.gz FAIL Per base sequence content ES1_GTCCGC_L008_R1_001.fastq.gz WARN Per sequence GC content ES1_GTCCGC_L008_R1_001.fastq.gz PASS Per base N content ES1_GTCCGC_L008_R1_001.fastq.gz PASS Sequence Length Distribution ES1_GTCCGC_L008_R1_001.fastq.gz FAIL Sequence Duplication Levels ES1_GTCCGC_L008_R1_001.fastq.gz PASS Overrepresented sequences ES1_GTCCGC_L008_R1_001.fastq.gz PASS Adapter Content ES1_GTCCGC_L008_R1_001.fastq.gz FAIL Kmer Content ES1_GTCCGC_L008_R1_001.fastq.gz

Head of the FastQC data file (FASTQ_fastqc/fastqc_data.txt), which contains raw numbers for the plots etc):

##FastQC 0.11.3 >>Basic Statistics pass #Measure Value Filename ES1_GTCCGC_L008_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 12317096 Sequences flagged as poor quality 0 Sequence length 101 %GC 50 >>END_MODULE >>Per base sequence quality pass #Base Mean Median Lower Quartile Upper Quartile 10th Percentile 90th Pe rcentile 1 32.80553403172306 33.0 33.0 33.0 33.0 33.0 …