Automatically testing the out-of-distribution reasoning capabilities of LLMs/AGI systems with generative formal games

Vonderlind, Philip

doi:10.34726/hss.2024.118784

Record link:

https://doi.org/10.34726/hss.2024.118784
http://hdl.handle.net/20.500.12708/202524

Title:

Automatically testing the out-of-distribution reasoning capabilities of LLMs/AGI systems with generative formal games

Citation:

Vonderlind, P. (2024). Automatically testing the out-of-distribution reasoning capabilities of LLMs/AGI systems with generative formal games [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2024.118784

reposiTUm DOI:

10.34726/hss.2024.118784

CatalogPlus:

AC17335351

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Vonderlind, Philip

Advisor:

Lukasiewicz, Thomas

Organisational Unit:

E192 - Institut für Logic and Computation

Date (published):

2024

Number of Pages:

Keywords:

Large Language Models; Reasoning; Benchmark; Out-of-distribution

Abstract:

In recent years, large language models have become prominent tools for a variety of reasoning tasks, ranging from coding to solving puzzles. However, as most of these models are offered as black-box online services, testing their reasoning performance using traditional benchmark datasets may not reflect their true capabilities due to the memorization of public data. To solve this problem, I establish as a main contribution a framework that automatically generates structured datasets of formal games, which can be used to evaluate the out-of-distribution capabilities of language models offered as a service. The games generated by our framework are based on a novel domain-specific language I call Grid-Games. Furthermore, I introduce a complexity metric that categorizes each generated game based on intrinsic task difficulty. To test the distribution shift of known games and generated games, I conduct experiments on three prominent language models and compare the performances. Our main finding is that there exists a large shift in performance between the new and unknown Grid-Games, which are not included in any training data, and the known game of Tic-Tac-Toe, that I used as an exemplary game that likely was in the training data of all tested large language models.

License:

In Copyright

Appears in Collections:

Thesis