Faithful Attention Attribution in Vision Transformers for Chest X-Ray Interpretation

Sula, Julius

doi:10.34726/hss.2026.132361

Record link:

https://doi.org/10.34726/hss.2026.132361
http://hdl.handle.net/20.500.12708/226961

Title:

Faithful Attention Attribution in Vision Transformers for Chest X-Ray Interpretation

Citation:

Sula, J. (2026). Faithful Attention Attribution in Vision Transformers for Chest X-Ray Interpretation [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2026.132361

reposiTUm DOI:

10.34726/hss.2026.132361

CatalogPlus:

AC17802980

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Sula, Julius

Advisor:

Lukasiewicz, Thomas

Co-advisor:

Menzat, Bayar Ilhan

Organisational Unit:

E192 - Institut für Logic and Computation

Date (published):

2026

Number of Pages:

Keywords:

Vision Transformers; Attribution Methods; Faithfulness; Interpretability; Attention-Based Explanations; Sparse Autoencoders; Feature-Level Interpretability; Mechanistic Interpretability

Abstract:

Vision Transformers (ViTs) achieve strong performance in natural and medical imaging, yet their decision processes remain opaque. This is especially problematic in high-stakes settings like chest X-ray interpretation. TransMM is among the strongest attribution methods for ViTs, combining attention with class-specific gradients to highlight influential image patches. We ask whether injecting semantic structure from Sparse Autoencoders (SAEs) can further improve the faithfulness of such attributions.We introduce Feature-Gradient Attribution, which extends TransMM’s principle from attention space to feature space. SAEs are trained on residual streams to decompose activations into sparse, interpretable features, providing per-patch feature activations. We project gradients onto the SAE feature basis and compute feature-gradient scores that capture both which learned features are present and how they influence the target logit. These scores yield per-patch gates that modulate TransMM’s attention maps before relevance propagation, forming a lightweight, semantically informed correction.Across three datasets (chest X-rays, endoscopy, natural images), two architectures (finetuned ViT-B/16 and contrastively pre-trained CLIP ViT-B/32), and three complementary faithfulness metrics, our method improves attribution faithfulness consistently. Improvements are statistically significant (p<0.001) on all three metrics for one dataset and on two of three metrics for the remaining datasets. We observe gains of 10.5-34.8% on SaCo and 9.7-43.0% on Faithfulness Correlation, with Pixel Flipping improving by 1.8-10.8%. Notably, we never observe degradation relative to TransMM on any metric–dataset combination.

Additional information:

Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft

License:

In Copyright

Appears in Collections:

Thesis