-
Natural sciences
- Development of bioinformatics software, tools and databases
At the core of numerous bioinformatics tools lies the FM-index, a full-text index enabling highly efficient sequence search. Nevertheless, tools based on the FM-index encounter limitations when confronted with large collections of genomes, as their memory demands increase proportionally with sequence volume. This is particularly cumbersome in the era of pan-genome analysis, where the use of a single reference genome is abandoned in favor of large collections of genomes from multiple individual and/or related species.
The r-index has recently emerged as a significantly more memory-efficient substitute for the FM-index. It is particularly suited for pan-genomes. By employing compressed versions of the Burrows-Wheeler Transform and the suffix array, it achieves sub-linear scaling of memory usage relative to sequence volume. Practical reductions in memory requirements, exceeding a factor of 10, have been demonstrated.
However, the r-index comes with a performance penalty. Various techniques have been suggested -also by our research group- to bridge this performance gap. To radically expand the utility of the r-index, we aim to leverage our research group's expertise in lossless approximate pattern matching using search schemes and compacted pan-genome de Bruijn graphs. Our overarching objective is to develop highly memory-efficient pan-genome graph representations based on the r-index and apply them to practical bioinformatics applications in microbiology and human genetics.