<div class="csl-bib-body">
<div class="csl-entry">Althammer, S. (2023). <i>Addressing data availability and document-to-document retrieval for domain-specific neural rankers</i> [Dissertation, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2024.123265</div>
</div>
-
dc.identifier.uri
https://doi.org/10.34726/hss.2024.123265
-
dc.identifier.uri
http://hdl.handle.net/20.500.12708/198776
-
dc.description.abstract
Neural ranking and retrieval models based on pretrained language models have demonstrated great effectiveness gains for Information Retrieval (IR) in the web domain compared to statistical and early neural ranking models. Bringing these advancements to domain-specific retrieval tasks poses multiple challenges for neural ranking and retrieval models: the queries and documents can be much longer than in web search and there is less high-quality evaluation and training data available for domain-specific retrieval compared to web search.In this thesis we address these challenges with the goal of promoting and improving the adoption of neural ranking and retrieval models for domain-specific retrieval tasks. Document-to-Document retrieval tasks, where the query and the documents in the corpus are long documents, are important tasks in the legal and patent domain. We reproduce and improve a paragraph-level interaction re-ranking model for the document-to-document retrieval task of legal case retrieval and we demonstrate the re-ranking models’ effectiveness for prior art search in the patent domain. In order to bring improvements of first stage retrieval methods from web search to the task of document-to-document retrieval, we propose a paragraph aggregation retrieval model. The paragraph aggregation retrieval model liberates neural first stage retrieval models from their limited input length and increases effectiveness and interpretability in the first stage retrieval for the task of legal case retrieval. We increase the availability of high-quality evaluation data by conducting an annotation campaign and comparing relevance signals from the click data to our human-label annotations for domain specific retrieval in the health domain. Since annotated training data is limited and expensive to produce for domain-specific retrieval tasks, we study training neural ranking and retrieval models under a limited annotation and training budget. We investigate active learning methods for improving the annotation efficiency for training neural ranking and retrieval models focusing on a cost-based evaluation.
en
dc.language
English
-
dc.language.iso
en
-
dc.rights.uri
http://rightsstatements.org/vocab/InC/1.0/
-
dc.subject
Information Retrieval
en
dc.subject
Neural Retrieval
en
dc.subject
Domain Specific Information Retrieval
en
dc.subject
Legal Information Retrieval
en
dc.subject
Patent Information Retrieval
en
dc.subject
Large Language Models
en
dc.title
Addressing data availability and document-to-document retrieval for domain-specific neural rankers
en
dc.type
Thesis
en
dc.type
Hochschulschrift
de
dc.rights.license
In Copyright
en
dc.rights.license
Urheberrechtsschutz
de
dc.identifier.doi
10.34726/hss.2024.123265
-
dc.contributor.affiliation
TU Wien, Österreich
-
dc.rights.holder
Sophia Althammer
-
dc.publisher.place
Wien
-
tuw.version
vor
-
tuw.thesisinformation
Technische Universität Wien
-
tuw.publication.orgunit
E194 - Institut für Information Systems Engineering
-
dc.type.qualificationlevel
Doctoral
-
dc.identifier.libraryid
AC17225967
-
dc.description.numberOfPages
181
-
dc.thesistype
Dissertation
de
dc.thesistype
Dissertation
en
dc.rights.identifier
In Copyright
en
dc.rights.identifier
Urheberrechtsschutz
de
tuw.advisor.staffStatus
staff
-
tuw.advisor.orcid
0000-0002-7149-5843
-
item.languageiso639-1
en
-
item.openairetype
doctoral thesis
-
item.openairecristype
http://purl.org/coar/resource_type/c_db06
-
item.grantfulltext
open
-
item.cerifentitytype
Publications
-
item.fulltext
with Fulltext
-
item.mimetype
application/pdf
-
item.openaccessfulltext
Open Access
-
crisitem.author.dept
E194-04 - Forschungsbereich Data Science
-
crisitem.author.parentorg
E194 - Institut für Information Systems Engineering