Reproducible ranking lists for retrieval from evolving document collections : how column-store technology enhances the capability of inverted indicees

Bösch, Hannes

doi:10.34726/hss.2017.24819

Record link:

https://doi.org/10.34726/hss.2017.24819
http://hdl.handle.net/20.500.12708/5118

Title:

Reproducible ranking lists for retrieval from evolving document collections : how column-store technology enhances the capability of inverted indicees

Citation:

Bösch, H. (2017). Reproducible ranking lists for retrieval from evolving document collections : how column-store technology enhances the capability of inverted indicees [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2017.24819

reposiTUm DOI:

10.34726/hss.2017.24819

CatalogPlus:

AC13711125

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Bösch, Hannes

Advisor:

Rauber, Andreas

Organisational Unit:

E188 - Institut für Softwaretechnik und Interaktive Systeme

Date (published):

2017

Number of Pages:

138

Keywords:

Informationssuche; Textsuche; Reproduzierbare Ergebnisliste von Dokumenten; Spaltenbasierte Datenbanken; Inverted Index; Retrieval Index Performance; MonetDB; Apache Lucene; Zitieren von Daten; BM25; Wikipedia; Dokument-Zerteilung; IR Prototyping

Information Retrieval; Text Retrieval; Reproducible Retrieval Ranked Lists; Column-Store Database; Inverted Index; Retrieval Index Performance; MonetDB; Apache Lucene; Data Citation; BM25; Wikipedia; Document Parsing; IR Prototyping

Abstract:

The core structure of (probabilistic) information retrieval systems lacks the ability to make retrieval result rankings reproducible. When the underlying data changes, IR indices change over time and especially the history of tf-idf values is hard to preserve. Thus, the same query might produce different results when the collection has been updated in the meantime. Only little research is directed to reproducibility in IR, though it would be desirable in fields of research or patent applications. The first step into this direction is to have subsets of documents in a dynamically evolving data environment unambiguously identifiable. This can be achieved with structured data and a data schema suitable for scalable data citation (cf. section 3). It suggests maintaining a history of evolving data by tagging data records with timestamps and keeping a version history for each update on the collection. Conventional row-stores cannot deal with this large volume data and statistics aggregations, as it would be required for IR applications. Yet, the column-store architecture is designed for analytical workloads and has already been proposed for IR-prototyping (cf. section 4.2), an approach for building retrieval indices on top of RDBMS. This thesis combines the concepts of IR-prototyping with data citation in order to enhance retrieval indices to achieve reproducibility. It addresses questions on how database schemes have to be shaped and if these models are efficient to deal with today¿s requirements on retrieval systems. The results hold promises for the future.

Additional information:

Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers

License:

In Copyright

Appears in Collections:

Thesis