Hofstätter, S. (2018). Adaptierung von Word Embeddings für domänenspezifisches Information Retrieval [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2018.50325
E188 - Institut für Softwaretechnik und Interaktive Systeme
-
Date (published):
2018
-
Number of Pages:
68
-
Keywords:
Informationsrückgewinnung; Word Embeddings; Word2Vec; Globaler Kontext; Verwandte Wörter
de
Information Retrieval; Word Embeddings; Word2Vec; Global context; Related terms
en
Abstract:
Search engines rank documents based on their relevance to a given query – using only exact word matches might miss results. Expanding a document retrieval query with similar words gained from a word embedding offers great potential for better query results. The expansion of the search space allows to retrieve relevant documents, even if they do not contain the actual query. An additional word improves the query results only if it is relevant to the topic of the search. As observed by previous studies, an essential problem in using an out-of-box word embedding for document retrieval is that some of the added similar words have a negative impact on the retrieval performance. We create word embedding based similarity models, which are used to expand query words in domain-specific Information Retrieval. For this we adapt an existing word embedding with additional information gained from different contexts -- we incorporate them into a Skip-gram word embedding with Retrofitting. We experiment with different external resources: Latent Semantic Indexing, semantic lexicons. We also study various techniques to combine two different external resources. We first analyze changes in the local neighborhoods of query terms and global differences between the original and retrofitted vector spaces. We then evaluate the effect of the changed word embeddings on domain-specific retrieval test collections. We report improved results on some test collections. In conclusion, we show that in two out of three test collections, incorporating external resources significantly improves the results over using an out-of-the-box word embedding.