Eder, C. (2025). Honeypot LLM: Creation of the Scam Conversation Corpus [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2025.122387
This paper presents the Scam Conversation Corpus (SCC), a unique dataset comprising conversations between GPT-4o, acting as a potential fraud victim, and genuine fraudsters. The dataset was created using a Honeypot approach to establish a social media presence that attracts fraudsters. We developed a response application that facilitates linking the Large Language Model (LLM) with the communication platforms Instagram, Telegram, and email. This application not only connects these platforms but also allows for seamless switching between them during a conversation. This setup enabled the collection of conversations, including all multimedia content received from fraudsters. To evaluate the collected dataset, we conducted a comparative analysis using Logistic Regression and XGBoost, alongside the publicly available Scam-baiting Dataset (ScamBait) and a newly compiled non-scam dataset. The findings indicate that models trained on the SCC generalise effectively to the ScamBait, whereas the reverse is not true. Finally, a qualitative analysis was conducted to uncover linguistic and structural patterns in the datasets that may account for the asymmetry in generalisability. All code for the experiments is available in a public repository under MIT licence and all code for the data collection, as well as the dataset itself, will be made available upon request for research purposes. This thesis was submitted to the ACL Workshop on Online Abuse and Harms 2025 as a two-column long paper.
en
Additional information:
Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers