Süss, M. (2026). Faithfulness of Natural Language Explanations for Vision Language Models: An Automated Test Framework [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2026.132363
Natural-language explanations (NLEs) produced by vision-language models (VLMs) are often treated as indicators of transparency, yet their faithfulness, i.e., whether they truly reflect the factors driving model decisions, remains largely unexamined. This thesis introduces a counterfactual, intervention-based framework for evaluating explanation faithfulness in multimodal settings, extending prior text-only approaches to object-level image interventions. We derive two metrics, the visual counterfactual test (vCT) and the visual correlational counterfactual test (vCCT), and support scalable evaluation through an automated pipeline that generates high-quality counterfactual image pairs via semantic object removal. Applying this framework to seven state-of-the-art VLMs on SNLI-VE, we find that explanations are often but not always faithful. While faithfulness scores are generally high, unfaithful cases are consistently observed. Moreover, the choice of prompting strategy has an impact, with predict then explain prompting yielding more faithful explanations than explain then predict. These findings highlight both the promise and the limitations of current VLM explanations and emphasize the need for rigorous faithfulness evaluation before relying on them in high-stakes applications.
en
Additional information:
Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft