Bösch, H. (2017). Reproducible ranking lists for retrieval from evolving document collections : how column-store technology enhances the capability of inverted indicees [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2017.24819
E188 - Institut für Softwaretechnik und Interaktive Systeme
Number of Pages:
Informationssuche; Textsuche; Reproduzierbare Ergebnisliste von Dokumenten; Spaltenbasierte Datenbanken; Inverted Index; Retrieval Index Performance; MonetDB; Apache Lucene; Zitieren von Daten; BM25; Wikipedia; Dokument-Zerteilung; IR Prototyping
Information Retrieval; Text Retrieval; Reproducible Retrieval Ranked Lists; Column-Store Database; Inverted Index; Retrieval Index Performance; MonetDB; Apache Lucene; Data Citation; BM25; Wikipedia; Document Parsing; IR Prototyping
The core structure of (probabilistic) information retrieval systems lacks the ability to make retrieval result rankings reproducible. When the underlying data changes, IR indices change over time and especially the history of tf-idf values is hard to preserve. Thus, the same query might produce different results when the collection has been updated in the meantime. Only little research is directed to reproducibility in IR, though it would be desirable in fields of research or patent applications. The first step into this direction is to have subsets of documents in a dynamically evolving data environment unambiguously identifiable. This can be achieved with structured data and a data schema suitable for scalable data citation (cf. section 3). It suggests maintaining a history of evolving data by tagging data records with timestamps and keeping a version history for each update on the collection. Conventional row-stores cannot deal with this large volume data and statistics aggregations, as it would be required for IR applications. Yet, the column-store architecture is designed for analytical workloads and has already been proposed for IR-prototyping (cf. section 4.2), an approach for building retrieval indices on top of RDBMS. This thesis combines the concepts of IR-prototyping with data citation in order to enhance retrieval indices to achieve reproducibility. It addresses questions on how database schemes have to be shaped and if these models are efficient to deal with today¿s requirements on retrieval systems. The results hold promises for the future.
Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers