Vonderlind, P. (2024). Automatically Testing the Out-of-Distribution Reasoning Capabilities of LLMs with Generative Formal Games [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2024.118784
Large Language Models; Reasoning; Benchmark; Out-of-distribution
en
Abstract:
In recent years, large language models have become prominent tools for a variety of reasoning tasks, ranging from coding to solving puzzles. However, as most of these models are offered as black-box online services, testing their reasoning performance using traditional benchmark datasets may not reflect their true capabilities due to the memorization of public data. To solve this problem, I establish as a main contribution a framework that automatically generates structured datasets of formal games, which can be used to evaluate the out-of-distribution capabilities of language models offered as a service. The games generated by our framework are based on a novel domain-specific language I call Grid-Games. Furthermore, I introduce a complexity metric that categorizes each generated game based on intrinsic task difficulty. To test the distribution shift of known games and generated games, I conduct experiments on three prominent language models and compare the performances. Our main finding is that there exists a large shift in performance between the new and unknown Grid-Games, which are not included in any training data, and the known game of Tic-Tac-Toe, that I used as an exemplary game that likely was in the training data of all tested large language models.
en
Additional information:
Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers