<div class="csl-bib-body">
<div class="csl-entry">Leventidis, A., Christensen, M. P., Lissandrini, M., Di Rocco, L., Hose, K., & Miller, R. J. (2024). A Large Scale Test Corpus for Semantic Table Search. In <i>SIGIR ’24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval</i> (pp. 1142–1151). Association for Computing Machinery. https://doi.org/10.1145/3626772.3657877</div>
</div>
-
dc.identifier.uri
http://hdl.handle.net/20.500.12708/210867
-
dc.description.abstract
Table search aims to answer a query with a ranked list of tables. Unfortunately, current test corpora have focused mostly on needle-in-the-haystack tasks, where only a few tables are expected to exactly match the query intent. Instead, table search tasks often arise in response to the need for retrieving new datasets or augmenting existing ones, e.g., for data augmentation within data science or machine learning pipelines. Existing table repositories and benchmarks are limited in their ability to test retrieval methods for table search tasks. Thus, to close this gap, we introduce a novel dataset for query-by-example Semantic Table Search. This novel dataset consists of two snapshots of the large-scale Wikipedia tables collection from 2013 and 2019 with two important additions: (1) a page and topic aware ground truth relevance judgment and (2) a large-scale DBpedia entity linking annotation. Moreover, we generate a novel set of entity-centric queries that allows testing existing methods under a novel search scenario: semantic exploratory search. The resulting resource consists of 9,296 novel queries, 610,553 query-table relevance annotations, and 238,038 entity-linked tables from the 2013 snapshot. Similarly, on the 2019 snapshot, the resource consists of 2,560 queries, 958,214 relevance annotations, and 457,714 total tables. This makes our resource the largest annotated table-search corpus to date (97 times more queries and 956 times more annotated tables than any existing benchmark). We perform a user study among domain experts and prove that these annotators agree with the automatically generated relevance annotations. As a result, we can re-evaluate some basic assumptions behind existing table search approaches identifying their shortcomings along with promising novel research directions.
en
dc.language.iso
en
-
dc.rights.uri
http://creativecommons.org/licenses/by/4.0/
-
dc.subject
benchmark
en
dc.subject
query-by-example
en
dc.subject
semantic search
en
dc.subject
table search
en
dc.title
A Large Scale Test Corpus for Semantic Table Search
en
dc.type
Inproceedings
en
dc.type
Konferenzbeitrag
de
dc.rights.license
Creative Commons Namensnennung 4.0 International
de
dc.rights.license
Creative Commons Attribution 4.0 International
en
dc.contributor.affiliation
Northeastern University, United States of America (the)
-
dc.contributor.affiliation
Aalborg University, Denmark
-
dc.contributor.affiliation
University of Verona, Italy
-
dc.contributor.affiliation
Northeastern University, United States of America (the)
-
dc.contributor.affiliation
Northeastern University, United States of America (the)