<div class="csl-bib-body">
<div class="csl-entry">Mathis, T. G. (2025). <i>Identifying Data Contamination in LLMs for Mathematical Benchmarks</i> [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2025.129501</div>
</div>
-
dc.identifier.uri
https://doi.org/10.34726/hss.2025.129501
-
dc.identifier.uri
http://hdl.handle.net/20.500.12708/216244
-
dc.description
Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft
-
dc.description
Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers
-
dc.description.abstract
Large language models (LLMs) have demonstrated impressive capabilities in mathematical reasoning tasks. However, concerns persist around data contamination, where benchmark problems used for evaluation have appeared in the model's pretraining data. Such contamination can artificially inflate performance metrics, particularly in domains where genuine reasoning must be distinguished from memorization. This thesis introduces MathCONTA, a novel mathematical dataset for contamination detection. It spans multiple domains—including algebra, number theory, combinatorics, and integration—and covers various difficulty levels, from simple word problems to advanced math contest problems. MathCONTA consists of two balanced subsets: one containing problems known to have been seen during pretraining (contaminated) and another with entirely novel problems (uncontaminated). In contrast to previous studies that simulate contamination via finetuning, MathCONTA reflects how contamination arises naturally during pretraining, offering a more realistic basis for evaluation. Using this dataset, we evaluate several representative detection methods, including minK%, n-gram accuracy, and Contamination Detection via output Distribution (CDD), spanning confidence-based and memorization-based approaches. Our findings show, perhaps surprisingly, that none of the tested techniques reliably distinguish between contaminated and uncontaminated items. Moreover, combining these methods does not significantly improve detection performance. We hypothesize that this is because MathCONTA requires detecting subtle, incidental mathematical contamination arising naturally during large-scale pretraining, rather than the more obvious contamination introduced through fine-tuning. Finally, we analyze the downstream impact of contamination on model accuracy and find that while it can lead to modest gains at specific difficulty levels, it is unlikely to be the primary factor behind recent advances in LLM-based mathematical reasoning. To support transparency and future research, MathCONTA and all accompanying code for experiments will be made openly available.
en
dc.language
English
-
dc.language.iso
en
-
dc.rights.uri
http://rightsstatements.org/vocab/InC/1.0/
-
dc.subject
Data Contamination
en
dc.subject
LLM
en
dc.subject
LLMs
en
dc.subject
Benchmarks
en
dc.subject
Evaluation
en
dc.subject
Contamination Detection Methods
en
dc.subject
Membership inference
en
dc.subject
Mathematical Benchmarks
en
dc.subject
MathCONTA
en
dc.title
Identifying Data Contamination in LLMs for Mathematical Benchmarks
en
dc.type
Thesis
en
dc.type
Hochschulschrift
de
dc.rights.license
In Copyright
en
dc.rights.license
Urheberrechtsschutz
de
dc.identifier.doi
10.34726/hss.2025.129501
-
dc.contributor.affiliation
TU Wien, Österreich
-
dc.rights.holder
Tobias Gallus Mathis
-
dc.publisher.place
Wien
-
tuw.version
vor
-
tuw.thesisinformation
Technische Universität Wien
-
tuw.publication.orgunit
E192 - Institut für Logic and Computation
-
dc.type.qualificationlevel
Diploma
-
dc.identifier.libraryid
AC17562735
-
dc.description.numberOfPages
77
-
dc.thesistype
Diplomarbeit
de
dc.thesistype
Diploma Thesis
en
dc.rights.identifier
In Copyright
en
dc.rights.identifier
Urheberrechtsschutz
de
tuw.advisor.staffStatus
staff
-
item.openaccessfulltext
Open Access
-
item.grantfulltext
open
-
item.openairecristype
http://purl.org/coar/resource_type/c_bdcc
-
item.fulltext
with Fulltext
-
item.cerifentitytype
Publications
-
item.languageiso639-1
en
-
item.openairetype
master thesis
-
crisitem.author.dept
E101 - Institut für Analysis und Scientific Computing