Project

Search schemes for sequence alignment to pan-genome graphs.

Code

3F012921

Duration

01 November 2021 → 31 October 2025

Funding

Research Foundation - Flanders (FWO)

Promotor

Jan Fostier

Research disciplines

Natural sciences
- Analysis of next-generation sequence data
- Development of bioinformatics software, tools and databases
Engineering and technology
- Bio-informatics
- High performance computing

Keywords

bio-informatics Approximate string matching Sequence-to-graph alignment Pan-genomics

Project description

Pan-genomics is a quickly evolving field driven by the rapidly increasing number of sequenced genomes of individuals. Because of the wide applicability of pan-genome data structures and functionality, we will develop scalable, graph-based pan-genome representations as well as algorithms that enable efficient search functionality. The main driving innovative factor for the search functionality is the detection of non-contiguous occurrences of reads against the pan-genome. By allowing jumps within the pan-genome graph when aligning a read to it, our algorithms will be able to infer the origin of a newly sequenced species as a mosaic composition of multiple, related species. A second goal for the search functionality is compatibility with long, high-error reads (Pacific Biosciences or Oxford Nanopore Technologies, which have error rates up to 15%) next to short, low-error reads (Illumina). We aim to accomplish this by developing novel seed identification algorithms to improve the seed-and-extend paradigm. Specifically, we will study pan-genome graph representations based on the Burrows-Wheeler transform (BWT), since they require little memory and support lossless approximate pattern matching due to recent algorithmic developments on bidirectional BWT-based indexes and search schemes. Search schemes will be used for seed identification.