Linear discriminant analysis for high dimensional data : a comparison of sparse classification methods

Hoffmann, Irene

doi:10.34726/hss.2014.22581

Record link:

https://doi.org/10.34726/hss.2014.22581
http://hdl.handle.net/20.500.12708/5807

Title:

Linear discriminant analysis for high dimensional data : a comparison of sparse classification methods

Citation:

Hoffmann, I. (2014). Linear discriminant analysis for high dimensional data : a comparison of sparse classification methods [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2014.22581

reposiTUm DOI:

10.34726/hss.2014.22581

CatalogPlus:

AC11581330

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Hoffmann, Irene

Advisor:

Filzmoser, Peter

Organisational Unit:

E105 - Institut für Statistik und Wahrscheinlichkeitstheorie

Date (published):

2014

Number of Pages:

Keywords:

discriminant analysis; sparsity

Abstract:

Die lineare Diskriminanzanalyse ist eine beliebte und weitverbreitete Methode des überwachten Lernens. In dieser Arbeit wird detailliert auf zwei lineare Klassifikationsmethoden eingegangen, die zu äquivalenten Modellen führen. Dies sind Fishers LDA und der optimal scoring Ansatz, die jeweils durch ein Optimierungsproblem beschrieben werden. Diese Methoden sind jedoch auf solche Datensätze beschränkt, bei denen die Anzahl der Beobachtungen die der Variablen übersteigt. Daher werden die Optimierungsprobleme modifiziert, um auf hochdimensionale Daten angewandt werden zu können, indem man einen Penalty-Term hinzufügt und die Kovarianz innerhalb der Klassen als Diagonalmatrix schätzt. So erhält man Penalized LDA und sparse discriminant analysis. Die Anwendung eines L1 Penalty-Terms führt bei diesen Methoden zu Sparse Models. Qualitätsmerkmale für die Modelle sind die Genauigkeit der Vorhersage der Klassenzugehörigkeiten sowie die Sparsity, welche durch die zuverlässige Identifikation der beeinflussenden Variablen und die Reduktion des Rauschanteils bewertet wird. Es wird eine Simulationsstudie durchgeführt, um die Stärken und Schwächen der Methoden zu untersuchen, wenn diese auf Daten mit einem hohen Rauschanteil angewendet werden. Weiters wird die Qualität der Modelle anhand von zwei Datensätzen aus Genexpressions-Experimenten bewertet.

Linear discriminant analysis is a popular method for supervised classification, which performs well under various circumstances. In this thesis the equivalence of two linear classification methods is discussed in detail, namely Fisher's LDA and the optimal scoring approach, which is obtained from different optimization problems. These methods are limited to the analysis of data where the number of observations is higher than the number of predictor variables. Hence, the optimization problems are modified to be applicable in the high dimensional setting via penalization terms and diagonal within-class covariance estimates. The resulting methods are penalized LDA and sparse discriminant analysis, respectively. With an L1 penalization term sparse models are obtained with these methods. The criteria for the model quality are the prediction performance of the class memberships and the sparsity, which is evaluated by the identification of the influential variables and the reduction of the noise. A simulation study is conducted to investigate the strengths and weaknesses of the methods when applied to data with a high percentage of noise. Further, the quality of the models is evaluated for two gene expression data sets.

Additional information:

Zsfassung in dt. Sprache. - Literaturverz. S. 61 - 62

License:

In Copyright

Appears in Collections:

Thesis