Rippberger Fonseca, G. (2025). Comparative Analysis of Fashion Captioning and Multimodal Fashion Recommendation [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2025.115007
This thesis explores two main tasks: (1) fine-tuning image captioning models for fashion datasets and (2) evaluating different feature spaces for personalized fashion recommendations. We fine-tune state-of-the-art vision-language models - BLIP-2 and LLaVA - on two fashion datasets, H&M and FACAD, to generate product descriptions. Our quantitative and qualitative analyses show that fine-tuning can achieve performance levels comparable to fully training a model (SRFC) specifically for generating "fashion captions".With our qualitative analysis of the captioning results, we take a deep dive into understanding the models' limitations and identify what works well and what does not. We find that working with datasets that have clearly identifiable visual cues for words, e.g., front pocket, can improve the fine-tuning process. The models struggled with non-visual attributes (e.g., material composition, designer names), distinguishing fine-grained differences (e.g., satin vs. velvet), and handling partial or ambiguous product images. These limitations highlight the need for dataset curation that emphasizes visible attributes.For recommendations, we extract multimodal features (visual, textual, and combined) and evaluate them using the VBPR recommendation algorithm on the H&M dataset. Besides sophisticated models for feature embeddings such as ResNet50 (visual features) or SentenceBERT (textual features), we use our, on the H&M dataset, fine-tuned BLIP-2 model to extract additional features, which we hypothesized to work better. Surprisingly, textual embeddings performed better than visual and multimodal features with VBPR, suggesting that text-based attributes provide better signals for recommendations than image features, in this setup. However, overall performance across different feature spaces remains similar, with ItemKNN outperforming VBPR results.Our findings demonstrate that fine-tuning is an effective and simpler alternative to complex reward-based training. Additionally, despite fashion being a visual domain, textual descriptions resulted in the best recommendation performance. Future work should focus on exploring the performance of already available models for fashion datasets and refining datasets for better performance.
en
Additional information:
Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers