Identifying data contamination in LLMs for mathematical benchmarks

Mathis, Tobias Gallus

doi:10.34726/hss.2025.129501

DC Field

Value

Language

dc.contributor.advisor

Lukasiewicz, Thomas

dc.contributor.author

Mathis, Tobias Gallus

dc.date.accessioned

2025-06-20T05:48:36Z

dc.date.issued

2025

dc.date.submitted

2025-05

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Mathis, T. G. (2025). <i>Identifying data contamination in LLMs for mathematical benchmarks</i> [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2025.129501</div> </div>

dc.identifier.uri

https://doi.org/10.34726/hss.2025.129501

dc.identifier.uri

http://hdl.handle.net/20.500.12708/216244

dc.description.abstract

Large language models (LLMs) have demonstrated impressive capabilities in mathematical reasoning tasks. However, concerns persist around data contamination, where benchmark problems used for evaluation have appeared in the model's pretraining data. Such contamination can artificially inflate performance metrics, particularly in domains where genuine reasoning must be distinguished from memorization. This thesis introduces MathCONTA, a novel mathematical dataset for contamination detection. It spans multiple domains—including algebra, number theory, combinatorics, and integration—and covers various difficulty levels, from simple word problems to advanced math contest problems. MathCONTA consists of two balanced subsets: one containing problems known to have been seen during pretraining (contaminated) and another with entirely novel problems (uncontaminated). In contrast to previous studies that simulate contamination via finetuning, MathCONTA reflects how contamination arises naturally during pretraining, offering a more realistic basis for evaluation. Using this dataset, we evaluate several representative detection methods, including minK%, n-gram accuracy, and Contamination Detection via output Distribution (CDD), spanning confidence-based and memorization-based approaches. Our findings show, perhaps surprisingly, that none of the tested techniques reliably distinguish between contaminated and uncontaminated items. Moreover, combining these methods does not significantly improve detection performance. We hypothesize that this is because MathCONTA requires detecting subtle, incidental mathematical contamination arising naturally during large-scale pretraining, rather than the more obvious contamination introduced through fine-tuning. Finally, we analyze the downstream impact of contamination on model accuracy and find that while it can lead to modest gains at specific difficulty levels, it is unlikely to be the primary factor behind recent advances in LLM-based mathematical reasoning. To support transparency and future research, MathCONTA and all accompanying code for experiments will be made openly available.

dc.language

English

dc.language.iso

dc.rights.uri

http://rightsstatements.org/vocab/InC/1.0/

dc.subject

Data Contamination

dc.subject

LLM

dc.subject

LLMs

dc.subject

Benchmarks

dc.subject

Evaluation

dc.subject

Contamination Detection Methods

dc.subject

Membership inference

dc.subject

Mathematical Benchmarks

dc.subject

MathCONTA

dc.title

Identifying data contamination in LLMs for mathematical benchmarks

dc.type

Thesis

dc.type

Hochschulschrift

dc.rights.license

In Copyright

dc.rights.license

Urheberrechtsschutz

dc.identifier.doi

10.34726/hss.2025.129501

dc.contributor.affiliation

TU Wien, Österreich

dc.rights.holder

Tobias Gallus Mathis

dc.publisher.place

Wien

tuw.version

vor

tuw.thesisinformation

Technische Universität Wien

tuw.publication.orgunit

E192 - Institut für Logic and Computation

dc.type.qualificationlevel

Diploma

dc.identifier.libraryid

AC17562735

dc.description.numberOfPages

dc.thesistype

Diplomarbeit

dc.thesistype

Diploma Thesis

dc.rights.identifier

In Copyright

dc.rights.identifier

Urheberrechtsschutz

tuw.advisor.staffStatus

staff

item.cerifentitytype

Publications

item.openairecristype

http://purl.org/coar/resource_type/c_bdcc

item.openaccessfulltext

Open Access

item.grantfulltext

open

item.openairetype

master thesis

item.fulltext

with Fulltext

item.languageiso639-1

item.mimetype

application/pdf

crisitem.author.dept

E101 - Institut für Analysis und Scientific Computing

crisitem.author.parentorg

E100 - Fakultät für Mathematik und Geoinformation

Appears in Collections:

Thesis

Fulltext (Version of Record (published version))

Adobe PDF

(1.45 MB)

In Copyright

Show simple item record

Page view(s)

checked on Jun 20, 2025

Download(s)

206

checked on Jun 20, 2025

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM