Project

Search schemes for sequence alignment to pan-genome graphs.

Code
3F012921
Duration
01 November 2021 → 31 October 2025
Funding
Research Foundation - Flanders (FWO)
Promotor
Research disciplines
  • Natural sciences
    • Analysis of next-generation sequence data
    • Development of bioinformatics software, tools and databases
  • Engineering and technology
    • Bio-informatics
    • High performance computing
Keywords
bio-informatics Approximate string matching Sequence-to-graph alignment Pan-genomics
 
Project description

Pan-genomics is a quickly evolving field driven by the rapidly increasing number of sequenced genomes of individuals. Because of the wide applicability of pan-genome data structures and functionality, we will develop scalable, graph-based pan-genome representations as well as algorithms that enable efficient search functionality. The main driving innovative factor for the search functionality is the detection of non-contiguous occurrences of reads against the pan-genome. By allowing jumps within the pan-genome graph when aligning a read to it, our algorithms will be able to infer the origin of a newly sequenced species as a mosaic composition of multiple, related species. A second goal for the search functionality is compatibility with long, high-error reads (Pacific Biosciences or Oxford Nanopore Technologies, which have error rates up to 15%) next to short, low-error reads (Illumina). We aim to accomplish this by developing novel seed identification algorithms to improve the seed-and-extend paradigm. Specifically, we will study pan-genome graph representations based on the Burrows-Wheeler transform (BWT), since they require little memory and support lossless approximate pattern matching due to recent algorithmic developments on bidirectional BWT-based indexes and search schemes. Search schemes will be used for seed identification.