Protein phosphorylation is the most ubiquitous of post-translational modifications, regulating a wide variety of essential functions from cell-cycle progression through to DNA damage repair. Phosphorylation is regulated by the kinases – a super-family of proteins that comprise the third largest protein family in the human genome. While advances in high-throughput mass spectrometry have resulted in the identification of hundreds of thousands of phosphorylation sites, the identification of the kinases that regulate these phosphorylation events has largely remained elusive. Understanding the kinases responsible for phosphorylation events is often crucial for understanding the function of the modification, however the transient nature of kinase binding means that identifying genuine kinase-binding events in vivo is both difficult and expensive.
The vast majority of methods for computationally predicting kinase binding targets rely primarily on sequence features. A lack of specificity in many kinase-binding motifs means that valid binding patterns can be found randomly throughout the proteome – leaving such meth- ods susceptible to high false-positive rates. However, the determinants of phosphorylation are not limited to the sequence; kinases are regulated through various cellular processes including mediating/activating proteins, localisation and cell cycle-specific expression. While such information has increasingly become accessible through proteomic databases, incomplete coverage, variable certainty and the heterogeneous nature of context and sequence information means that the integration of relevant features into a computational model is non-trivial.
In this thesis I present a method for the probabilistic integration of these two aspects of kinase regulation – context and sequence – into a Bayesian network model that can accurately predict kinase substrates. In the first part of the thesis I demonstrate how a model that incorporates knowledge of kinase-substrate phosphorylation, protein interactions and protein abundance across the cell cycle can be used to classify kinase substrates. The model achieves high level of prediction accuracy as determined by cross-validation, obtaining an average AUC of 0.86 across all kinases tested. When applying the model to complement sequence-based kinase- specific phosphorylation site prediction using previously published methods, I find it improves prediction performance for most comparisons made. As a validation of these ideas, I also show how protein interaction networks can be coupled with gene expression data to predict changes in phosphorylation status in response to varying cell treatment conditions.
To integrate kinase-binding affinity into the modelling framework, I present a method for classifying kinase-binding sites from sequence, which captures features from the linear motifs surrounding known kinase-specific phosphorylation sites. This method incorporates observed position-specific amino acid frequencies and counts of co-occurring neighbouring amino acids into a Bayesian network model. The model is trained to discriminate between a kinase’s binding profile, that of its family members, and a phosphorylation background. I show how this sequence model can be integrated as a module into the larger context model, allowing for a comprehensive description of the factors that influence kinase binding. This seamless integration of context and sequence increases kinase-substrate prediction accuracy, when compared to the first context model, by over 50% at low false-positive levels. I find that this system of predicting kinase substrates, coupled with predicting kinase binding sites from sequence, convincingly outperforms existing kinase-specific phosphorylation site classifiers; a comparison of prediction accuracy at strict specificity levels shows that my method predicts kinase-specific phosphorylation sites with an average of 9-22% greater sensitivity (at a strict specificity level of 99.9%) than the alternatives. The method, named PhosphoPICK, has been made freely available as a web-service.
Possessing a predictor that ably integrates the context and sequence conditions that regulate phosphorylation allows an approach to problems in phosphorylation that were not feasible previously. Non-synonymous single nucleotide polymorphisms (nsSNPs) have the potential to disrupt (or introduce) kinase binding sites through the modification of key amino acids that mediate kinase activity. To validate that PhosphoPICK accurately represents the biological characteristics determining phosphorylation occurrence, I developed a method applying PhosphoPICK to predict variant-causing phosphorylation loss and gain. The method quantifies the expected effect of a nsSNP on phosphorylation based on predictions from the sequence model, and the probability that a query kinase will target the variant protein. Employing distributions of predicted variants across the proteome, the method can provide a measure of the significance of novel variants. Evaluating the method on known examples of variants causing phosphorylation loss or gain from the literature, I show that PhosphoPICK can detect the positive examples at strict specificity levels.
While the methodology presented in this work was developed for phosphorylation, it should be considered a framework that could be applied to alternative biological processes. Sequence motifs and protein interactions are necessary elements for a spectrum of biology, including post-translational modifications other than phosphorylation. The short ubiquitin-like modifier (SUMO), for example, operates on defined sequence motifs, but is also highly dependent on the context factors that SUMO substrates operate in. The methods I describe allow an approach to alternative protein prediction problems, such as SUMOylation, where the integration of context and sequence characteristics can provide a comprehensive description of the relevant regulatory features.