Project

Compressed Data Structures for Pan-genome Graphs

Code
bof/baf/4y/2024/01/287
Duration
01 January 2024 → 31 December 2025
Funding
Regional and community funding: Special Research Fund
Promotor
Research disciplines
  • Natural sciences
    • Development of bioinformatics software, tools and databases
Keywords
Algorithm development r-index Bioinformatics
 
Project description

At the core of numerous bioinformatics tools lies the FM-index, a full-text index enabling highly efficient sequence search. Nevertheless, tools based on the FM-index encounter limitations when confronted with large collections of genomes, as their memory demands increase proportionally with sequence volume. This is particularly cumbersome in the era of pan-genome analysis, where the use of a single reference genome is abandoned in favor of large collections of genomes from multiple individual and/or related species.

The r-index has recently emerged as a significantly more memory-efficient substitute for the FM-index. It is particularly suited for pan-genomes. By employing compressed versions of the Burrows-Wheeler Transform and the suffix array, it achieves sub-linear scaling of memory usage relative to sequence volume. Practical reductions in memory requirements, exceeding a factor of 10, have been demonstrated.

However, the r-index comes with a performance penalty. Various techniques have been suggested -also by our research group- to bridge this performance gap. To radically expand the utility of the r-index, we aim to leverage our research group's expertise in lossless approximate pattern matching using search schemes and compacted pan-genome de Bruijn graphs. Our overarching objective is to develop highly memory-efficient pan-genome graph representations based on the r-index and apply them to practical bioinformatics applications in microbiology and human genetics.