dc.description.abstract
Already a few years ago reports appeared in popular computer science magazines that molecular biology data is exponentially growing [1]. More recently, there have been concerns that NGS/sequencing data is growing faster than computer storage capacities, despite the exponential growth of this storage [2, 3]. This thesis deals with these large amounts of data, therefore this work is clearly located in the field of bioinformatics. Additionally, the classic paradigm of 'one gene/protein - one function or phenotype' has shifted from being the main approach to just one of several options, with most of these combining large amounts of information to arrive at a conclusion [4-7]. Several terms exist for this: systems biology, the -omics field, integrative analysis, and a few more. The present work makes a broad sweep of the field, from whole genome microarrays, through metabolomics, to sequencing data, with sidetracks into the complexities of a combinatorial problem, dimension reduction, and transposons. The steady goal is to gain general insights into the full data collection and/or to indicate other promising procedures. The work on differentially expressed genes in tumors started with the diploma thesis of the applicant and was continued in a later article. The main result is that, although the overlaps of differential expressed genes in tumors from the same tumor type seem random, these gene lists share elements on a protein interaction network level. The observation of metabolite levels from tissues is still an immature field; currently, several 100 different metabolites can be distinguished. At the time of the dataset for my analysis, about 100 metabolites had been safely identified. In comparison to the pn (i.e., many more variables than data records) problems in bioinformatics, this is rather a standard problem and a classification can be made with known machine learning methods. We were thus able to create classification models by which we could identify key metabolites in renal cell carcinoma. NGS data is basically a paragon for pn data, and this situation will not change for a while since several million variations can be found in a population with feasible levels of effort but far fewer than a million individuals are usually sequenced. In some cases, more variations may be found than individuals that even exist for the sequenced species. These datasets present certain issues, which can be summarized by the curse of dimensionality and potential population structure. Since I have been working for the last few years on the 1001 Genomes Project [8], my main data source was the largest collection of sequenced Arabidopsis thaliana. As a model species, A. thaliana offers several advantages: it is fast growing; recombinant inbred lines are possible; the genome is quite small; and there are no ethical concerns. On the other hand, it is a 'mere weed'. For such pn data, a subfield of machine learning, dimension reduction, is very helpful. We combined these fields for visualization and added a new measure of the 'quality' of the visualizations. For the transposons hidden in the 1001 genomes data, we developed a new transposon caller tool, which leverages our data in a better way. PhD thesis, page 3 Additional challenges in a project of this scale are data collection, organization, development of other calling pipelines, a final consistency check, and of course selling it reasonable high as paper(s). Apart from the last point, where I was just one in a group of people involved, the remaining points were headed up by me for a longer phase in the project. Another result that arose within the above mentioned data sets is the solution of the combinatorial problem of getting an exact p-value when putative regulations are inferred and the unbiased validation is a set of proven transcription factors (TRANSFAC database [9]). The outcome is that an exact solution is possible with a computational complexity of O(n 3). This work resulted in some publications and several useful insights, which are unfortunately not enough for full papers. These latter are also described here.
en