Eukaryotic cells are composed of two large compartments, the nucleus and the cytoplasm. The nucleus of a cell is of primary importance and is considered to be the control center that supervises the metabolic functioning of the cell and that eventually determines the cell’s characteristics. Different macromolecules, including RNAs, which are transcribed in the nucleus, and other proteins, which are translated in the cytoplasm, cross the nuclear envelope and work in a dynamic fashion. Multiple cellular functions, e.g., DNA replication, DNA damage repair, and gene expression control, are also performed inside nucleus.
Nuclear localization signals (NLSs) provide binding sites for transport proteins (known as importins or karyopherins) during regulated nuclear import. To date, hundreds of nuclear localization signal motifs have been identified, and some have not yet become well-defined because of low over-representation and sequence conservation in the NLS data.
To explain (1) how the proteins are accurately targeted into the nucleus via different import pathways, (2) what localization signals are employed, (3) what function the nuclear proteins perform, and (4) why import sometimes goes awry in biological terms, computational models are required to explain nucleo-cytoplasmic trafficking. Surprisingly, there has been relatively little attention directed toward understanding the dynamics of nucleo-cytoplasmic transport. To answer the above questions, we first need to evaluate the available data resources and techniques.
Advancements in high throughput technologies in biological research and progress in proteomic projects have led us to an extensive increase in proteomic data. This increase has resulted in the production of diverse large-scale proteomic data that represent the positions and locations of molecules that are involved in nucleo-cytoplasmic trafficking. Bayesian networks can integrate diverse and large-scale data to develop models of molecular systems, such as the “nuclear import of proteins". Such models can naturally link data features and improve our ability to predict the determinants of the nuclear import of proteins. Moreover, these methods can help us to explain the functions of proteins in the nucleus.
This thesis aims to understand (1) How can we best use the available machine learning algorithms to integrate large-scale heterogeneous data and to develop models of molecular systems? (2) How and when is a protein accurately targeted to the nucleus? (3) How can we distinguish between real and spurious NLSs? (4) How the efficiency of nuclear proteins is modulated by their abundance?
First, we provide a review of the available data resources that can help us to build models of the “nuclear import of proteins". We then develop models of the nuclear import of proteins by integrating heterogeneous data. Specifically, by using Bayesian networks, we propose models that recoginze the nuclear localization of signals (NLSs), protein interactions and protein sequence data. These models accurately predict the nuclear import of proteins, NLSs and proteinkaryopherin interactions, surpassing the classification accuracy of standard nuclear import predictors.
Second, we develop a novel method, discriminative local motif (DLocalMotif ), which uses positional information of localization signals and negative data and distinguishes between real and spurious NLSs. We show that, because of the lowover-representation but high spatial confinement of the NLS data, the available methods fail to operate on such data. We discover several functional novel motifs within proline-tyrosine (PY)-NLSs that overlap with C2H2 zinc finger domains. Wealso apply this new method to several other localization signals, such as the peroxisomal targeting signal-1 (PTS1) and the endoplasmic reticulum (ER) retention signal, and we discover novel biologically meaningful motifs.
Last, we focus on developing a protein abundance model to better how the efficiency of transcription factors is modulated by their abundance. Specifically, we develop a Bayesian network model that uses genomic and proteomic data and predicts protein abundance and the folding energy of a transcript. This model improves our understanding of how to link the transcriptional environment with proteomic data and has been used to investigate how to use these predictions to more accurately understand cellular functions. Because of the naturally linking genomic and proteomic data features, we show that our model is highly accurate compared to other available methods. We use ourmodel to illustrate how it improves the analysis of protein regulation during the cell cycle.
The methods proposed in this thesis will help researchers to predict the regulated nuclear import of proteins across different species. By predicting NLS locations, NLS-karyopherin interaction and protein abundance, the models can collectively demonstrate whether the accumulation of cargo protein is sufficient to performspecific functions in the nucleus.