Adaptierung von Word Embeddings für domänenspezifisches Information Retrieval

Hofstätter, Sebastian

doi:10.34726/hss.2018.50325

Record link:

https://doi.org/10.34726/hss.2018.50325
http://hdl.handle.net/20.500.12708/1842

Title:

Adaptierung von Word Embeddings für domänenspezifisches Information Retrieval

Citation:

Hofstätter, S. (2018). Adaptierung von Word Embeddings für domänenspezifisches Information Retrieval [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2018.50325

reposiTUm DOI:

10.34726/hss.2018.50325

CatalogPlus:

AC15057909

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Hofstätter, Sebastian

Advisor:

Hanbury, Allan

Co-advisor:

Rekabsaz, Navid

Organisational Unit:

E188 - Institut für Softwaretechnik und Interaktive Systeme

Date (published):

2018

Number of Pages:

Keywords:

Informationsrückgewinnung; Word Embeddings; Word2Vec; Globaler Kontext; Verwandte Wörter

Information Retrieval; Word Embeddings; Word2Vec; Global context; Related terms

Abstract:

Search engines rank documents based on their relevance to a given query – using only exact word matches might miss results. Expanding a document retrieval query with similar words gained from a word embedding offers great potential for better query results. The expansion of the search space allows to retrieve relevant documents, even if they do not contain the actual query. An additional word improves the query results only if it is relevant to the topic of the search. As observed by previous studies, an essential problem in using an out-of-box word embedding for document retrieval is that some of the added similar words have a negative impact on the retrieval performance. We create word embedding based similarity models, which are used to expand query words in domain-specific Information Retrieval. For this we adapt an existing word embedding with additional information gained from different contexts -- we incorporate them into a Skip-gram word embedding with Retrofitting. We experiment with different external resources: Latent Semantic Indexing, semantic lexicons. We also study various techniques to combine two different external resources. We first analyze changes in the local neighborhoods of query terms and global differences between the original and retrofitted vector spaces. We then evaluate the effect of the changed word embeddings on domain-specific retrieval test collections. We report improved results on some test collections. In conclusion, we show that in two out of three test collections, incorporating external resources significantly improves the results over using an out-of-the-box word embedding.

License:

In Copyright

Appears in Collections:

Thesis