Probabilistic modeling of high-dimensional and high-throughput biological data

Godsey, Brian

doi:10.34726/hss.2013.21386

Datensatz Zitierlink:

https://doi.org/10.34726/hss.2013.21386
http://hdl.handle.net/20.500.12708/6883

Titel:

Probabilistic modeling of high-dimensional and high-throughput biological data

Zitat:

Godsey, B. (2013). Probabilistic modeling of high-dimensional and high-throughput biological data [Dissertation, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2013.21386

reposiTUm-DOI:

10.34726/hss.2013.21386

CatalogPlus:

AC10775506

Publikationstyp:

Hochschulschrift - Dissertation

Sprache:

Englisch

Autor_innen:

Godsey, Brian

Betreuer_in:

Filzmoser, Peter

Mitbetreuer_innen:

Frommlet, Florian

Organisationseinheit:

E105 - Institut für Statistik und Wahrscheinlichkeitstheorie

Datum (veröffentlicht):

2013

Umfang:

Keywords:

Wahrscheinlichkeitsmodelle; hoch-dimensionale; hoch-durchsatz; gene; bayesian

probabilistic modeling; gene; bayesian; clustering

Abstract:

Probabilistic modeling is a type of statistical analysis that focuses on the inherent randomness of natural systems, and which avoids taking any summary value---such as an expected value---prior to the final results, and even then the results are often listed alongside a measure of probability or certainty. Clearly, all statistical methods involve some measure of certainty, but probabilistic modeling emphasizes the uncertainty of all quantities, including intermediate values and data points. High-dimensional data consist of relatively few measurements of a large number of quantities, and tools which measure thousands of quantities simultaneously, resulting in high-dimensional data, are called "high-throughput''; a popular example in bioinformatics is a microarray experiment, which may comprise fewer than twenty measurements of each of thousands of genes. This often creates under-determined systems, depending on the statistical model chosen, in which there could be many possible parameter solutions that are able to produce the same result. There are ways to address this under-determinedness, among which probabilistic modeling has clear advantages, both theoretically and practically. Analysis of high-dimensional data can benefit from probabilistic modeling primarily because the consideration of inherent randomness allows a probabilistic model to consider not only the existence of one or many parameter solutions to the system, but also consider the probability or likelihood of a particular solution. For instance, if a data set contains replicates, the reproducibility (a type of inherent randomness) can be exploited to prefer the most reproducible good solution. The biological sciences contain many such situations, in which we have relatively few measurements of many quantities, and thus provide the opportunity to demonstrate the usefulness of probabilistic modeling, while solving some important biological problems. This work presents probabilistic modeling as an approach to data-oriented scientific problems, including both knowledge-based model design as well as parameter inference via two popular inference algorithms: variational Bayesian learning and Markov Chain Monte Carlo (MCMC) sampling. The probabilistic framework is then applied to three problem classes in the biological sciences: gene-gene interaction, microRNA-gene targeting, and the prediction and comparison of athletic performances. In all three cases, the inference and/or prediction proved valuable in both understanding the underlying system as well as indicating likely candidates for further study---perhaps via more sensitive individual experiments. Overall, this work presents the methods and illustrative applications that detail the concept and process of probabilistic modeling of high-dimensional data, allowing interested researchers to follow similar steps in their own work to create and derive insight from probabilistic models of biological systems.

Probabilistic Modeling ist einer Form der statistischen Analyse, die auf die inherenten Zufälligkeit von natürlichen Systemen fokussiert ist, und quantitiative Zusammenfassungen---zum Beispiel Mittelwertbildung---vermeidet, bis zum letzten Schritt der Analyse, und selbst dann bleibt eine Unsicherheit oder Wahrscheinlichkeit. Natürlich spielt Wahrscheinlichkeit eine Rolle in jeder statistischen Analyse, aber Probabilistic Modeling betont, dass wir immer eine Menge Unsicherheit haben, mit jedem Wert. High-dimensional Data (hoch-dimensionale Daten) bestehen aus relativ wenige Datenpunkten für eine große Anzahl von Variablen. Messinstrumente, die tausende Werte gleichzeitig messen können und dadurch hoch-dimensionale Daten produzieren, wurden dann high-throughput genannt. In der Bioinformatik, ein Beispiel dafür wäre ein Microarray Experiment, bei dem man typischerweise weniger als zwanzig Messungen durchführt, aber jede Messung besteht aus tausenden Genen. Die nachfolgende Analyse ist also oft unterbestimmt, abhängig von den verwendeten Methoden, worin es viele mögliche Losungen für die Systeme gibt, die alle die gleiche Ergebnisse oder Daten liefern können. Probabilistic Modeling hat verschiedene theoretische und praktische Vorteile, die die folgende Probleme zu überwinden helfen. Die Analyse von hoch-dimensionalen Daten ist eine gute Anwendung für Probabilistic Modeling aufgrund seiner Berücksichtigung von Zufälligkeit und lässt ein Modell alle mögliche Losungen beachten und dann nach Wahrscheinlichkeit bewerten, um genauer zu sagen, welche Lösungen besser passen. Diese Arbeit präsentiert Probabilistic Modeling als Betrachtungsweise für datenorientierte wissenschaftliche Probleme; diese inkludiert nämlich die Entwicklung von Modellen, die auf Fachwissen basiert sind, und Parameterinferenz durch bekannte Methoden: variational Bayesian Learning und Markov Chain Monte Carlo (MCMC) Sampling. Diese Methoden werden dann in drei biologische Problemklassen angewendet: Gene-Gene Interaktionen, microRNA-Gene Targeting, und das Prognostizieren und Vergleichen von Leistungen in Leichtathletik. In allen drei Fällen, erweist sich die die probabilistische Inferenz oder Prognose als bedeutend, in Hinblick auf das Verstehen der natürlichen Systeme, von denen die Daten stammen, und auf das Identifizieren von Kandidaten für weitere Studien. Im Allgemeinen beschreibt diese Arbeit probabilistische Methoden und demonstriert Anwendungen, die das Konzept und Ablauf des Probabilistic Modeling zeigen. Hoch-dimensionale Daten und Systeme sind ein wissenschaftlicher Bereich, für den diese Betrachtungsweise besonders geeignet ist. Das Ziel ist solche Methoden für alle interessierten Wissenschaftler erreichbar zu machen.

Weitere Information:

Zsfassung in dt. Sprache

Lizenz:

Urheberrechtsschutz

Enthalten in den Sammlungen:

Thesis