Comparative analysis of fashion captioning and multimodal fashion recommendation

Rippberger Fonseca, Gwendolyn

doi:10.34726/hss.2025.115007

Datensatz Zitierlink:

https://doi.org/10.34726/hss.2025.115007
http://hdl.handle.net/20.500.12708/215651

Titel:

Comparative analysis of fashion captioning and multimodal fashion recommendation

Zitat:

Rippberger Fonseca, G. (2025). Comparative analysis of fashion captioning and multimodal fashion recommendation [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2025.115007

reposiTUm-DOI:

10.34726/hss.2025.115007

CatalogPlus:

AC17527743

Publikationstyp:

Hochschulschrift - Diplomarbeit

Sprache:

Englisch

Autor_innen:

Rippberger Fonseca, Gwendolyn

Betreuer_in:

Neidhardt, Julia

Organisationseinheit:

E194 - Institut für Information Systems Engineering

Datum (veröffentlicht):

2025

Umfang:

Keywords:

Fashion Recommender Systems; Fashion Captioning; Multimodal Recommendations

Abstract:

This thesis explores two main tasks: (1) fine-tuning image captioning models for fashion datasets and (2) evaluating different feature spaces for personalized fashion recommendations. We fine-tune state-of-the-art vision-language models - BLIP-2 and LLaVA - on two fashion datasets, H&M and FACAD, to generate product descriptions. Our quantitative and qualitative analyses show that fine-tuning can achieve performance levels comparable to fully training a model (SRFC) specifically for generating "fashion captions".With our qualitative analysis of the captioning results, we take a deep dive into understanding the models' limitations and identify what works well and what does not. We find that working with datasets that have clearly identifiable visual cues for words, e.g., front pocket, can improve the fine-tuning process. The models struggled with non-visual attributes (e.g., material composition, designer names), distinguishing fine-grained differences (e.g., satin vs. velvet), and handling partial or ambiguous product images. These limitations highlight the need for dataset curation that emphasizes visible attributes.For recommendations, we extract multimodal features (visual, textual, and combined) and evaluate them using the VBPR recommendation algorithm on the H&M dataset. Besides sophisticated models for feature embeddings such as ResNet50 (visual features) or SentenceBERT (textual features), we use our, on the H&M dataset, fine-tuned BLIP-2 model to extract additional features, which we hypothesized to work better. Surprisingly, textual embeddings performed better than visual and multimodal features with VBPR, suggesting that text-based attributes provide better signals for recommendations than image features, in this setup. However, overall performance across different feature spaces remains similar, with ItemKNN outperforming VBPR results.Our findings demonstrate that fine-tuning is an effective and simpler alternative to complex reward-based training. Additionally, despite fashion being a visual domain, textual descriptions resulted in the best recommendation performance. Future work should focus on exploring the performance of already available models for fashion datasets and refining datasets for better performance.

Lizenz:

Urheberrechtsschutz

Enthalten in den Sammlungen:

Thesis