Complete annotation of the genome is imperative for understanding development, health, and disease. Nevertheless, the annotation of the protein coding genes is far from complete. Especially micropeptides, small proteins less than 100 amino acids, are historically underrepresented in gene annotation databases.
In my project proposal, I will develop a machine learning based algorithm to discover novel micropeptides in long non-coding RNA and circular RNA annotation. I will then apply this algorithm on large RNA sequencing transcriptomes of human and reference annotation of mouse, Arabidopsis and yeast to generate an in silico predicted micropeptidome. Subsequently, I will validate the existence of large numbers of these micropeptides using massive volumes of public tandem mass spectrometry data. To perform these analyses, I will rely on Ionbot, an in-house developed and state of the art sequence database search algorithm capable of performing open modification and open mutation searches. In parallel, I will create proteome-wide in silico spectral libraries and use these for spectral library searching on the same data. Finally, I will report all findings in a custom public micropeptide database.