Comparative analysis of fashion captioning and multimodal fashion recommendation

Rippberger Fonseca, Gwendolyn

doi:10.34726/hss.2025.115007

DC Field

Value

Language

dc.contributor.advisor

Neidhardt, Julia

dc.contributor.author

Rippberger Fonseca, Gwendolyn

dc.date.accessioned

2025-05-23T06:50:33Z

dc.date.issued

2025

dc.date.submitted

2025-04

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Rippberger Fonseca, G. (2025). <i>Comparative analysis of fashion captioning and multimodal fashion recommendation</i> [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2025.115007</div> </div>

dc.identifier.uri

https://doi.org/10.34726/hss.2025.115007

dc.identifier.uri

http://hdl.handle.net/20.500.12708/215651

dc.description.abstract

This thesis explores two main tasks: (1) fine-tuning image captioning models for fashion datasets and (2) evaluating different feature spaces for personalized fashion recommendations. We fine-tune state-of-the-art vision-language models - BLIP-2 and LLaVA - on two fashion datasets, H&M and FACAD, to generate product descriptions. Our quantitative and qualitative analyses show that fine-tuning can achieve performance levels comparable to fully training a model (SRFC) specifically for generating "fashion captions".With our qualitative analysis of the captioning results, we take a deep dive into understanding the models' limitations and identify what works well and what does not. We find that working with datasets that have clearly identifiable visual cues for words, e.g., front pocket, can improve the fine-tuning process. The models struggled with non-visual attributes (e.g., material composition, designer names), distinguishing fine-grained differences (e.g., satin vs. velvet), and handling partial or ambiguous product images. These limitations highlight the need for dataset curation that emphasizes visible attributes.For recommendations, we extract multimodal features (visual, textual, and combined) and evaluate them using the VBPR recommendation algorithm on the H&M dataset. Besides sophisticated models for feature embeddings such as ResNet50 (visual features) or SentenceBERT (textual features), we use our, on the H&M dataset, fine-tuned BLIP-2 model to extract additional features, which we hypothesized to work better. Surprisingly, textual embeddings performed better than visual and multimodal features with VBPR, suggesting that text-based attributes provide better signals for recommendations than image features, in this setup. However, overall performance across different feature spaces remains similar, with ItemKNN outperforming VBPR results.Our findings demonstrate that fine-tuning is an effective and simpler alternative to complex reward-based training. Additionally, despite fashion being a visual domain, textual descriptions resulted in the best recommendation performance. Future work should focus on exploring the performance of already available models for fashion datasets and refining datasets for better performance.

dc.language

English

dc.language.iso

dc.rights.uri

http://rightsstatements.org/vocab/InC/1.0/

dc.subject

Fashion Recommender Systems

dc.subject

Fashion Captioning

dc.subject

Multimodal Recommendations

dc.subject

Fashion Recommender Systems

dc.subject

Fashion Captioning

dc.subject

Multimodal Recommendations

dc.title

Comparative analysis of fashion captioning and multimodal fashion recommendation

dc.type

Thesis

dc.type

Hochschulschrift

dc.rights.license

In Copyright

dc.rights.license

Urheberrechtsschutz

dc.identifier.doi

10.34726/hss.2025.115007

dc.contributor.affiliation

TU Wien, Österreich

dc.rights.holder

Gwendolyn Rippberger Fonseca

dc.publisher.place

Wien

tuw.version

vor

tuw.thesisinformation

Technische Universität Wien

tuw.publication.orgunit

E194 - Institut für Information Systems Engineering

dc.type.qualificationlevel

Diploma

dc.identifier.libraryid

AC17527743

dc.description.numberOfPages

dc.thesistype

Diplomarbeit

dc.thesistype

Diploma Thesis

dc.rights.identifier

In Copyright

dc.rights.identifier

Urheberrechtsschutz

tuw.advisor.staffStatus

staff

tuw.advisor.orcid

0000-0001-7184-1841

item.cerifentitytype

Publications

item.openairecristype

http://purl.org/coar/resource_type/c_bdcc

item.openaccessfulltext

Open Access

item.grantfulltext

open

item.openairetype

master thesis

item.fulltext

with Fulltext

item.languageiso639-1

item.mimetype

application/pdf

crisitem.author.dept

E194-04 - Forschungsbereich Data Science

crisitem.author.parentorg

E194 - Institut für Information Systems Engineering

Appears in Collections:

Thesis

Fulltext (Version of Record (published version))

Adobe PDF

(2.63 MB)

In Copyright

Show simple item record

Page view(s)

109

checked on May 23, 2025

Download(s)

246

checked on May 23, 2025

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM