-
Natural sciences
- Genetics
- Systems biology
-
Medical and health sciences
- Molecular and cell biology
- Molecular and cell biology
Many applications in molecular biology rely on the analysis of next generation sequencing data. However, the presence of sequencing errors in raw sequencing data challenges these applications to properly discriminate between true biological signal and sequencing noise. We believe that current methodology can be improved. The research question of this proposal is thus how to maximally exploit all information present in raw sequencing data to detect and correct sequencing errors. We propose a methodology to identify sequencing errors by not only looking at each individual position (e.g. using read coverage support, quality scores) but also at the context in which a putative sequencing error occurs. Raw sequencing data is often represented in a graph structure called de Bruijn graph. We will exploit a graph-theoretical property of these de Bruijn graphs and integrate multiple de Bruijn graph representations in a single framework to make full use of the contextual information. This additional contextual information will result in a highly dimensional dataset, but we posit that probabilistic graphical models are ideally suited to deal with this in a statistically sound manner. We believe our methodology will improve various bioinformatics applications such as read correction, genome assembly and variant calling