<div class="csl-bib-body">
<div class="csl-entry">Vonderlind, P. (2024). <i>Automatically Testing the Out-of-Distribution Reasoning Capabilities of LLMs with Generative Formal Games</i> [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2024.118784</div>
</div>
-
dc.identifier.uri
https://doi.org/10.34726/hss.2024.118784
-
dc.identifier.uri
http://hdl.handle.net/20.500.12708/202524
-
dc.description
Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft
-
dc.description
Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers
-
dc.description.abstract
In recent years, large language models have become prominent tools for a variety of reasoning tasks, ranging from coding to solving puzzles. However, as most of these models are offered as black-box online services, testing their reasoning performance using traditional benchmark datasets may not reflect their true capabilities due to the memorization of public data. To solve this problem, I establish as a main contribution a framework that automatically generates structured datasets of formal games, which can be used to evaluate the out-of-distribution capabilities of language models offered as a service. The games generated by our framework are based on a novel domain-specific language I call Grid-Games. Furthermore, I introduce a complexity metric that categorizes each generated game based on intrinsic task difficulty. To test the distribution shift of known games and generated games, I conduct experiments on three prominent language models and compare the performances. Our main finding is that there exists a large shift in performance between the new and unknown Grid-Games, which are not included in any training data, and the known game of Tic-Tac-Toe, that I used as an exemplary game that likely was in the training data of all tested large language models.
en
dc.language
English
-
dc.language.iso
en
-
dc.rights.uri
http://rightsstatements.org/vocab/InC/1.0/
-
dc.subject
Large Language Models
en
dc.subject
Reasoning
en
dc.subject
Benchmark
en
dc.subject
Out-of-distribution
en
dc.title
Automatically Testing the Out-of-Distribution Reasoning Capabilities of LLMs with Generative Formal Games