-
Natural sciences
- Machine learning and decision making
- Computational biomodelling and machine learning
- Structural bioinformatics and computational proteomics
Currently, deep learning is rapidly providing breakthroughs for various proteomics prediction tasks. However, long-range interactions between disparate areas of a protein sequence, imposed by the 3D structure, are generally not exploited yet in these models. Context-aware embedding models recently revolutionized the field of natural language processing, with long-range dependency modelling as a key feature, and promising results have also been obtained with similar methods applied to protein representation learning. We aim to enrich such self-supervised embedding models by also training them for structural prediction, using publicly available protein structures. We will study how this can improve prediction performance on two structure-dependent downstream tasks. The first is protein secretability by yeast, a feature of high impact in fundamental biology and biotechnology. The second is kinase substrate site prediction, which is important in pharmacology. For both, labeled experimental datasets are available at the host lab, through novel experimental methods and novel data extraction methods. Furthermore, for proteins with known 3D structures, we will use attribution methods to map sequence- and structural input features that are learned as decisive factors by the models. In collaboration with the lab’s experimentalists, we will verify the fragments of the human proteome that our model predicts to be secretable, and evaluate predicted secretability-enhancing modifications.