Addressing data availability and document-to-document retrieval for domain-specific neural rankers

Althammer, Sophia

doi:10.34726/hss.2024.123265

Record link:

https://doi.org/10.34726/hss.2024.123265
http://hdl.handle.net/20.500.12708/198776

Title:

Addressing data availability and document-to-document retrieval for domain-specific neural rankers

Citation:

Althammer, S. (2023). Addressing data availability and document-to-document retrieval for domain-specific neural rankers [Dissertation, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2024.123265

reposiTUm DOI:

10.34726/hss.2024.123265

CatalogPlus:

AC17225967

Publication Type:

Thesis - Dissertation

Language:

English

Authors:

Althammer, Sophia

Advisor:

Hanbury, Allan

Organisational Unit:

E194 - Institut für Information Systems Engineering

Date (published):

2023

Number of Pages:

181

Keywords:

Information Retrieval; Neural Retrieval; Domain Specific Information Retrieval; Legal Information Retrieval; Patent Information Retrieval; Large Language Models

Abstract:

Neural ranking and retrieval models based on pretrained language models have demonstrated great effectiveness gains for Information Retrieval (IR) in the web domain compared to statistical and early neural ranking models. Bringing these advancements to domain-specific retrieval tasks poses multiple challenges for neural ranking and retrieval models: the queries and documents can be much longer than in web search and there is less high-quality evaluation and training data available for domain-specific retrieval compared to web search.In this thesis we address these challenges with the goal of promoting and improving the adoption of neural ranking and retrieval models for domain-specific retrieval tasks. Document-to-Document retrieval tasks, where the query and the documents in the corpus are long documents, are important tasks in the legal and patent domain. We reproduce and improve a paragraph-level interaction re-ranking model for the document-to-document retrieval task of legal case retrieval and we demonstrate the re-ranking models’ effectiveness for prior art search in the patent domain. In order to bring improvements of first stage retrieval methods from web search to the task of document-to-document retrieval, we propose a paragraph aggregation retrieval model. The paragraph aggregation retrieval model liberates neural first stage retrieval models from their limited input length and increases effectiveness and interpretability in the first stage retrieval for the task of legal case retrieval. We increase the availability of high-quality evaluation data by conducting an annotation campaign and comparing relevance signals from the click data to our human-label annotations for domain specific retrieval in the health domain. Since annotated training data is limited and expensive to produce for domain-specific retrieval tasks, we study training neural ranking and retrieval models under a limited annotation and training budget. We investigate active learning methods for improving the annotation efficiency for training neural ranking and retrieval models focusing on a cost-based evaluation.

License:

In Copyright

Appears in Collections:

Thesis