Sabetghadam, S. (2017). A graph-based model for multimodal information retrieval [Dissertation, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2017.42860
E188 - Institut für Softwaretechnik und Interaktive Systeme
-
Date (published):
2017
-
Number of Pages:
141
-
Keywords:
Multi-modal; Information Retrieval; Graph; Facet; Reachability; Poly-representation; Spreading Activation; Random Walks; Metropolis-Hastings
en
Abstract:
Finding useful information from large multimodal document collections such as the WWW is one of the major challenges of Information Retrieval (IR). Nowadays the proliferation of available information sources| text, images, audio, video and more|increase the need for multimodal search. Multimodal information retrieval is about the search for information of any modality on the web, with unimodal or multimodal queries. For instance, a unimodal query may contain only keywords, whereas multimodal queries may be a combination of keywords, images, video clips or music files. Users have learnt to explain their information need through keywords and expect the result as a combination of different modalities. Search engines like Google and Yahoo often show related videos or images in addition to the text result to the user. Usually, in a keyword based search, only the metadata information of a video or an image (e.g. tag, caption or description) is used to find relevant results. This approach is limited to textual information only and does not include information from other modalities. There is few options such as Google image search, which considers the image features to perform the image search task based on. In case the user query is an image, or a combination of a video file and keywords, the question arises how can a search engine benefit from different modalities in the query to retrieve multimodal results. Usually, search engines build upon text search by using non-visual information associated with visual content. This approach in multimodal search does not always result in satisfying results, as it completely ignores the information from other modalities in ranking. To address the problem of visual search approaches, multimodal search reranking has received increasing attention in recent years. In addition to the observation that data consumption today is highly multimodal, it is also clear that data is now heavily semantically interlinked. This can be through social networks (text, images, videos of users on LinkedIn, Facebook, or the like), or through the nature of the data itself (e.g. patent documents connected by their metadata-inventors, companies, semantic connections via linked data). Structured data is naturally represented by a graph, where nodes denote entities and directed/indirected edges represent the relations between them. Such graphs are heterogeneous, describing different types of objects and links. Connected data poses a challenge is traditional IR method which is based on independent documents. The question arises whether structured IR can be an option for retrieving more relevant data objects. In this thesis, we propose a graph-based model for multimodal information retrieval. We consider different relation types between information objects from different modalities. A framework is devised to leverage individual features of different modalities as well as their combination through our formulation of faceted search. We denote an inherent feature, metadata or property of an information object as its facet. We highlight the role of different facets of the user's query in visiting different parts of the graph. We employ a potential recall analysis on a test collection and further highlight the role of multiple facets, relations between the objects, and semantic links in recall improvement. Furthermore, we perform an analysis on score distribution on the graph at large number of steps to investigate the role of different facets and link types in the final performance of the system. The experiments are conducted on ImageCLEF 2011 Wikipedia collection, as a multimodal benchmark dataset containing approximately 400,000 documents and images.