Dürnberger, G. (2012). Classification of nucleic acid binding proteins by non parametric statistical methods and machine learning [Dissertation, Technische Universität Wien]. reposiTUm. http://hdl.handle.net/20.500.12708/160357
nucleic acid binding proteins; non parametric statistical methods; machine learning; SVM; Mass spectrometry
en
Abstract:
The interactions between proteins and nucleic acids have a fundamental function in many biological processes beyond nuclear gene transcription and include RNA homeostasis, protein translation and pathogen sensing for innate immunity. Transcriptional regulation is often facilitated by nucleic acid binding proteins (NABPs) that bind to specific nucleic acid sequence motifs. So far research has been focusing mainly on these proteins, although sequence specific NABPs constitute only a fraction of all NABPs. The aim of this study is to get an unbiased classification of both groups of NABPs and thereby provide a resource that contains information about both, sequence specific and non sequence specific NABPs. Affinity purification in combination with the resolution of current Mass Spectrometry approaches allows to get a near proteome wide picture of these interactions. Here, 25 systematically designed oligonucleotides were used to probe three human cell lines for NABPs. Overall more than 10000 interactions could be detected between the nucleic acid baits and almost one thousand unique proteins. Statistical methods to derive a global classification based on this experimental proteomics data set are evaluated in this work. Non parametric statistical methods allowed classification of proteins with high sensitivity. Application of these methods allowed to determine the specificity of 174 NABPs for different classes of nucleic acids. Knowledge extracted by the analysis is evaluated by available annotation as well as additional experiments to validate the analysis. Among these we could show cytosinemethylation specific binding of Y-box-binding protein 1 (YB-1) which was previously unknown and has potential implications in cancer research.<br />Sequence analysis of the detected proteins revealed candidate nucleic acid binding protein domains. Furthermore, based on the extracted classification, a Support Vector Machine could be trained that allows prediction of NABPs with high specificity solely based on amino acid sequence.