Project

Compressed Data Structures for Pan-genome Graphs

Code

bof/baf/4y/2024/01/287

Duration

01 January 2024 → 31 December 2025

Funding

Regional and community funding: Special Research Fund

Promotor

Jan Fostier

Research disciplines

Natural sciences
- Development of bioinformatics software, tools and databases

Keywords

Algorithm development r-index Bioinformatics

Project description

At the core of numerous bioinformatics tools lies the FM-index, a full-text index enabling highly efficient sequence search. Nevertheless, tools based on the FM-index encounter limitations when confronted with large collections of genomes, as their memory demands increase proportionally with sequence volume. This is particularly cumbersome in the era of pan-genome analysis, where the use of a single reference genome is abandoned in favor of large collections of genomes from multiple individual and/or related species.

The r-index has recently emerged as a significantly more memory-efficient substitute for the FM-index. It is particularly suited for pan-genomes. By employing compressed versions of the Burrows-Wheeler Transform and the suffix array, it achieves sub-linear scaling of memory usage relative to sequence volume. Practical reductions in memory requirements, exceeding a factor of 10, have been demonstrated.

However, the r-index comes with a performance penalty. Various techniques have been suggested -also by our research group- to bridge this performance gap. To radically expand the utility of the r-index, we aim to leverage our research group's expertise in lossless approximate pattern matching using search schemes and compacted pan-genome de Bruijn graphs. Our overarching objective is to develop highly memory-efficient pan-genome graph representations based on the r-index and apply them to practical bioinformatics applications in microbiology and human genetics.