-
Natural sciences
- Analysis of next-generation sequence data
- Development of bioinformatics software, tools and databases
-
Engineering and technology
- Bio-informatics
- High performance computing
Pan-genomics is a quickly evolving field driven by the rapidly increasing number of sequenced genomes of individuals. Because of the wide applicability of pan-genome data structures and functionality, we will develop scalable, graph-based pan-genome representations as well as algorithms that enable efficient search functionality. The main driving innovative factor for the search functionality is the detection of non-contiguous occurrences of reads against the pan-genome. By allowing jumps within the pan-genome graph when aligning a read to it, our algorithms will be able to infer the origin of a newly sequenced species as a mosaic composition of multiple, related species. A second goal for the search functionality is compatibility with long, high-error reads (Pacific Biosciences or Oxford Nanopore Technologies, which have error rates up to 15%) next to short, low-error reads (Illumina). We aim to accomplish this by developing novel seed identification algorithms to improve the seed-and-extend paradigm. Specifically, we will study pan-genome graph representations based on the Burrows-Wheeler transform (BWT), since they require little memory and support lossless approximate pattern matching due to recent algorithmic developments on bidirectional BWT-based indexes and search schemes. Search schemes will be used for seed identification.