Identifying Data Contamination in LLMs for Mathematical Benchmarks

Mathis, Tobias Gallus

doi:10.34726/hss.2025.129501

Record link:

https://doi.org/10.34726/hss.2025.129501
http://hdl.handle.net/20.500.12708/216244

Title:

Identifying Data Contamination in LLMs for Mathematical Benchmarks

Citation:

Mathis, T. G. (2025). Identifying Data Contamination in LLMs for Mathematical Benchmarks [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2025.129501

reposiTUm DOI:

10.34726/hss.2025.129501

CatalogPlus:

AC17562735

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Mathis, Tobias Gallus

Advisor:

Lukasiewicz, Thomas

Organisational Unit:

E192 - Institut für Logic and Computation

Date (published):

2025

Number of Pages:

Keywords:

Data Contamination; LLM; LLMs; Benchmarks; Evaluation; Contamination Detection Methods; Membership inference; Mathematical Benchmarks; MathCONTA

Abstract:

Large language models (LLMs) have demonstrated impressive capabilities in mathematical reasoning tasks. However, concerns persist around data contamination, where benchmark problems used for evaluation have appeared in the model's pretraining data. Such contamination can artificially inflate performance metrics, particularly in domains where genuine reasoning must be distinguished from memorization. This thesis introduces MathCONTA, a novel mathematical dataset for contamination detection. It spans multiple domains—including algebra, number theory, combinatorics, and integration—and covers various difficulty levels, from simple word problems to advanced math contest problems. MathCONTA consists of two balanced subsets: one containing problems known to have been seen during pretraining (contaminated) and another with entirely novel problems (uncontaminated). In contrast to previous studies that simulate contamination via finetuning, MathCONTA reflects how contamination arises naturally during pretraining, offering a more realistic basis for evaluation. Using this dataset, we evaluate several representative detection methods, including minK%, n-gram accuracy, and Contamination Detection via output Distribution (CDD), spanning confidence-based and memorization-based approaches. Our findings show, perhaps surprisingly, that none of the tested techniques reliably distinguish between contaminated and uncontaminated items. Moreover, combining these methods does not significantly improve detection performance. We hypothesize that this is because MathCONTA requires detecting subtle, incidental mathematical contamination arising naturally during large-scale pretraining, rather than the more obvious contamination introduced through fine-tuning. Finally, we analyze the downstream impact of contamination on model accuracy and find that while it can lead to modest gains at specific difficulty levels, it is unlikely to be the primary factor behind recent advances in LLM-based mathematical reasoning. To support transparency and future research, MathCONTA and all accompanying code for experiments will be made openly available.

Additional information:

Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft
Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers

License:

In Copyright

Appears in Collections:

Thesis