Binary text classification using positive and unlabeled data in application to Twitter

Hinterndorfer, David

doi:10.34726/hss.2020.74421

Record link:

https://doi.org/10.34726/hss.2020.74421
http://hdl.handle.net/20.500.12708/15248

Title:

Binary text classification using positive and unlabeled data in application to Twitter

Citation:

Hinterndorfer, D. (2020). Binary text classification using positive and unlabeled data in application to Twitter [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2020.74421

reposiTUm DOI:

10.34726/hss.2020.74421

CatalogPlus:

AC15713507

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Hinterndorfer, David

Advisor:

Hanbury, Allan

Organisational Unit:

E194 - Institut für Information Systems Engineering

Date (published):

2020

Number of Pages:

Keywords:

Textklassifizierung

Abstract:

Binäre Textklassifizierung unter Verwendung positiver und unbeschrifteter Daten, auch oft unter dem Namen "Positive-Unlabeled (PU) Learning" zu finden, ist ein gut erforschtes Gebiet im Bereich "Machine Learning (ML)". Das Problem ist, dass die meisten vorgestellten Ansätze nur in einer sehr generischen, experimentellen Umgebung getestet wurden und deshalb deren Eignung in spezifischen Domänen, wie Twitter, unklar ist. Aus diesem Grund ist das Ziel dieser Arbeit die Evaluierung existierender Ansätze und die anschließende Entwicklung eines "PU Learning" Ansatzes, in Bezug auf die Domäne Twitter. Um dies zu erreichen, verwenden wir eine explorative Strategie, um einen neuen, auf Twitter optimierten, “PU Learning” Ansatz zu entwickeln. Die Optimierungen bestehen zum einen aus der Ermittlung, Analyse, Einpflegung und Evaluierung von domänen-spezifischen Eigenschaften. Zum anderen, beinhalten diese die Kombination dieser Eigenschaften gemeinsam mit den bestgeeignetsten Algorithmen innerhalb eines neuen "PU Learning" Ansatzes. Das Resultat ist ein 2-stufiger Ansatz, mit dem Namen TWLR-SVM welcher "Weighted Logistic Regression" in der ersten Stufe und "SVM" in der zweiten Stufe verwendet. Die Evaluierung und Verifizierung der verschiedenen Entwicklungsstufen, wurde durch wiederholte, quantitative Experimente durchgeführt. Die Experimente betrachteten dabei immer die zwei Anwendungsfälle "Topic Detection" und "Spam Detection" in der Domäne Twitter. Die Ergebnisse zeigen das unser Ansatz, in verschiedensten Szenarien und Anwendungsfällen in der Domäne Twitter, sehr genaue Vorhersagen liefert und alle bisher in der Literatur vorgestellten Ansätze übertrifft. Um genau zu sein übertrifft unser Ansatz, TWLR-SVM, alle in der Literatur vorgestellten Ansätze in Hinblick auf durchschnittlich erzielten F1-Score um 6.37 Prozentpunkte bei "Topic Detection" und 4.38 Prozentpunkte bei "Spam Detection".

Binary text classification using only positive and unlabeled examples often found under the term positive-unlabeled (PU) learning, is a well-studied area in machine learning (ML). The problem is that most of the proposed techniques were only evaluated in a very generic experimental setup. Therefore, the suitability of most of these approaches for specific domains, such as Twitter, is unclear. For that reason, this work looks at PU learning problems related to the domain of Twitter. The aim is to evaluate existing approaches in the Twitter domain and consequently investigate and develop a PU learning approach, which is optimized to the Twitter domain. Based on an explorative strategy, we investigated and developed a new domain-optimized PU approach. The optimizations include, on the one hand, feature engineering in the form of investigation, analysis, incorporation and evaluation of domain-specific features. On the other hand, they contain the combination of those features alongside the best-suited algorithms in a new PU approach. The result is a new 2-step approach, called TWLR-SVM, which uses a domain-optimized version of Weighted Logistic Regression in the first step and Support Vector Machine (SVM) in the second step. To evaluate and verify the different optimizations, we conducted recurring quantitative experiments and analyzed their results. The conducted experiments always considered two use cases, namely topic detection and spam detection, in the domain of Twitter. The evaluation results indicate that our approach performs very well, delivers very precise predictions and outperforms all existing PU learning approaches, represented in the literature, on different scenarios and use cases in the Twitter domain. To be more precise, it is shown that our approach, TWLR-SVM, outperforms the existing PU approaches in the Twitter domain in terms of average F1-Score by 6.37 percentage points on topic detection and 4.38 percentage points on spam detection.

License:

In Copyright

Appears in Collections:

Thesis