<div class="csl-bib-body">
<div class="csl-entry">Emde, C., Paren, A. J., Arvind, P., Kayser, M. G., Rainforth, T., Lukasiewicz, T., Torr, P., & Bibi, A. (2025). Shh, don’t say that! Domain Certification in LLMs. In <i>The Thirteenth International Conference on Learning Representations,{ICLR} 2025, Singapore, April 24-28, 2025</i>. Thirteenth International Conference on Learning Representations (ICLR 2025), Singapore.</div>
</div>
-
dc.identifier.uri
http://hdl.handle.net/20.500.12708/223780
-
dc.description.abstract
Large language models (LLMs) are often deployed to perform constrained tasks,
with narrow domains. For example, customer support bots can be built on top of
LLMs, relying on their broad language understanding and capabilities to enhance
performance. However, these LLMs are adversarially susceptible, potentially generating outputs outside the intended domain. To formalize, assess, and mitigate
this risk, we introduce domain certification; a guarantee that accurately characterizes the out-of-domain behavior of language models. We then propose a simple
yet effective approach, which we call VALID that provides adversarial bounds
as a certificate. Finally, we evaluate our method across a diverse set of datasets,
demonstrating that it yields meaningful certificates, which bound the probability
of out-of-domain samples tightly with minimum penalty to refusal behavior.
en
dc.language.iso
en
-
dc.subject
large language models
en
dc.subject
natural language processing
en
dc.subject
Adversarial Robustness
en
dc.subject
adversary
en
dc.subject
natural text generation
en
dc.subject
certification
en
dc.subject
verification
en
dc.title
Shh, don't say that! Domain Certification in LLMs
en
dc.type
Inproceedings
en
dc.type
Konferenzbeitrag
de
dc.contributor.affiliation
Science Oxford, United Kingdom of Great Britain and Northern Ireland (the)
-
dc.contributor.affiliation
Mineralogical Society of the United Kingdom and Ireland, United Kingdom of Great Britain and Northern Ireland (the)
-
dc.type.category
Full-Paper Contribution
-
tuw.booktitle
The Thirteenth International Conference on Learning Representations,{ICLR} 2025, Singapore, April 24-28, 2025