Honeypot LLM : creation of the scam conversation corpus

Eder, Christoph

doi:10.34726/hss.2025.122387

Record link:

https://doi.org/10.34726/hss.2025.122387
http://hdl.handle.net/20.500.12708/216289

Title:

Honeypot LLM : creation of the scam conversation corpus

Citation:

Eder, C. (2025). Honeypot LLM : creation of the scam conversation corpus [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2025.122387

reposiTUm DOI:

10.34726/hss.2025.122387

CatalogPlus:

AC17563770

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Eder, Christoph

Advisor:

Recski, Gábor

Organisational Unit:

E194 - Institut für Information Systems Engineering

Date (published):

2025

Number of Pages:

Keywords:

Dataset Creation; Scam Detection; Conversational Dataset; Online Fraud; LLM; Natural Language Processing (NLP)

Abstract:

This paper presents the Scam Conversation Corpus (SCC), a unique dataset comprising conversations between GPT-4o, acting as a potential fraud victim, and genuine fraudsters. The dataset was created using a Honeypot approach to establish a social media presence that attracts fraudsters. We developed a response application that facilitates linking the Large Language Model (LLM) with the communication platforms Instagram, Telegram, and email. This application not only connects these platforms but also allows for seamless switching between them during a conversation. This setup enabled the collection of conversations, including all multimedia content received from fraudsters. To evaluate the collected dataset, we conducted a comparative analysis using Logistic Regression and XGBoost, alongside the publicly available Scam-baiting Dataset (ScamBait) and a newly compiled non-scam dataset. The findings indicate that models trained on the SCC generalise effectively to the ScamBait, whereas the reverse is not true. Finally, a qualitative analysis was conducted to uncover linguistic and structural patterns in the datasets that may account for the asymmetry in generalisability. All code for the experiments is available in a public repository under MIT licence and all code for the data collection, as well as the dataset itself, will be made available upon request for research purposes. This thesis was submitted to the ACL Workshop on Online Abuse and Harms 2025 as a two-column long paper.

License:

In Copyright

Appears in Collections:

Thesis