Accurate Identification of Sequencing Errors using Probabilistic Graphical Models

01 October 2018 → 30 September 2022
Research Foundation - Flanders (FWO)
Research disciplines
  • Natural sciences
    • Genetics
    • Systems biology
  • Medical and health sciences
    • Molecular and cell biology
    • Molecular and cell biology
Sequencing Errors
Project description

Many applications in molecular biology rely on the analysis of next generation sequencing data.
However, the presence of sequencing errors in raw sequencing data challenges these applications
to properly discriminate between true biological signal and sequencing noise. We believe that
current methodology can be improved. The research question of this proposal is thus how to
maximally exploit all information present in raw sequencing data to detect and correct sequencing
We propose a methodology to identify sequencing errors by not only looking at each individual
position (e.g. using read coverage support, quality scores) but also at the context in which a
putative sequencing error occurs. Raw sequencing data is often represented in a graph structure
called de Bruijn graph. We will exploit a graph-theoretical property of these de Bruijn graphs and
integrate multiple de Bruijn graph representations in a single framework to make full use of the
contextual information.
This additional contextual information will result in a highly dimensional dataset, but we posit that
probabilistic graphical models are ideally suited to deal with this in a statistically sound manner. We believe our methodology will improve various bioinformatics applications such as read
correction, genome assembly and variant calling