A Large Scale Test Corpus for Semantic Table Search

Leventidis, Aristotelis; Christensen, Martin Pekár; Lissandrini, Matteo; Di Rocco, Laura; Hose, Katja; Miller, Renée J.

doi:10.1145/3626772.3657877

DC Element

Wert

Sprache

dc.contributor.author

Leventidis, Aristotelis

dc.contributor.author

Christensen, Martin Pekár

dc.contributor.author

Lissandrini, Matteo

dc.contributor.author

Di Rocco, Laura

dc.contributor.author

Hose, Katja

dc.contributor.author

Miller, Renée J.

dc.date.accessioned

2025-02-04T16:54:53Z

dc.date.available

2025-02-04T16:54:53Z

dc.date.issued

2024

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Leventidis, A., Christensen, M. P., Lissandrini, M., Di Rocco, L., Hose, K., & Miller, R. J. (2024). A Large Scale Test Corpus for Semantic Table Search. In <i>SIGIR ’24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval</i> (pp. 1142–1151). Association for Computing Machinery. https://doi.org/10.1145/3626772.3657877</div> </div>

dc.identifier.uri

http://hdl.handle.net/20.500.12708/210867

dc.description.abstract

Table search aims to answer a query with a ranked list of tables. Unfortunately, current test corpora have focused mostly on needle-in-the-haystack tasks, where only a few tables are expected to exactly match the query intent. Instead, table search tasks often arise in response to the need for retrieving new datasets or augmenting existing ones, e.g., for data augmentation within data science or machine learning pipelines. Existing table repositories and benchmarks are limited in their ability to test retrieval methods for table search tasks. Thus, to close this gap, we introduce a novel dataset for query-by-example Semantic Table Search. This novel dataset consists of two snapshots of the large-scale Wikipedia tables collection from 2013 and 2019 with two important additions: (1) a page and topic aware ground truth relevance judgment and (2) a large-scale DBpedia entity linking annotation. Moreover, we generate a novel set of entity-centric queries that allows testing existing methods under a novel search scenario: semantic exploratory search. The resulting resource consists of 9,296 novel queries, 610,553 query-table relevance annotations, and 238,038 entity-linked tables from the 2013 snapshot. Similarly, on the 2019 snapshot, the resource consists of 2,560 queries, 958,214 relevance annotations, and 457,714 total tables. This makes our resource the largest annotated table-search corpus to date (97 times more queries and 956 times more annotated tables than any existing benchmark). We perform a user study among domain experts and prove that these annotators agree with the automatically generated relevance annotations. As a result, we can re-evaluate some basic assumptions behind existing table search approaches identifying their shortcomings along with promising novel research directions.

dc.language.iso

dc.rights.uri

http://creativecommons.org/licenses/by/4.0/

dc.subject

benchmark

dc.subject

query-by-example

dc.subject

semantic search

dc.subject

table search

dc.title

A Large Scale Test Corpus for Semantic Table Search

dc.type

Inproceedings

dc.type

Konferenzbeitrag

dc.rights.license

Creative Commons Namensnennung 4.0 International

dc.rights.license

Creative Commons Attribution 4.0 International

dc.contributor.affiliation

Northeastern University, United States of America (the)

dc.contributor.affiliation

Aalborg University, Denmark

dc.contributor.affiliation

University of Verona, Italy

dc.contributor.affiliation

Northeastern University, United States of America (the)

dc.contributor.affiliation

Northeastern University, United States of America (the)

dc.relation.isbn

979-8-4007-0431-4

dc.description.startpage

1142

dc.description.endpage

1151

dc.rights.holder

dc.type.category

Full-Paper Contribution

tuw.booktitle

SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

tuw.peerreviewed

true

tuw.relation.publisher

Association for Computing Machinery

tuw.relation.publisherplace

New York

tuw.researchTopic.id

tuw.researchTopic.name

Logic and Computation

tuw.researchTopic.value

100

tuw.publication.orgunit

E192-02 - Forschungsbereich Databases and Artificial Intelligence

tuw.publisher.doi

10.1145/3626772.3657877

dc.identifier.libraryid

AC17426721

dc.description.numberOfPages

tuw.author.orcid

0000-0002-7229-3936

tuw.author.orcid

0000-0003-3168-6810

tuw.author.orcid

0000-0001-7922-5998

tuw.author.orcid

0000-0002-8134-909X

tuw.author.orcid

0000-0001-7025-8099

tuw.author.orcid

0000-0002-5008-9774

dc.rights.identifier

CC BY 4.0

dc.rights.identifier

CC BY 4.0

tuw.event.name

SIGIR 2024: 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

dc.description.sponsorshipexternal

Horizion 2020

dc.description.sponsorshipexternal

Danish Council for Independent Research (DFF)

dc.relation.grantnoexternal

838216

dc.relation.grantnoexternal

DFF-8048- 00051B

tuw.event.startdate

14-07-2024

tuw.event.enddate

18-07-2024

tuw.event.online

On Site

tuw.event.type

Event for scientific audience

tuw.event.place

Washington, DC

tuw.event.country

tuw.event.presenter

Hose, Katja

wb.sciencebranch

Informatik

wb.sciencebranch

Mathematik

wb.sciencebranch.oefos

1020

wb.sciencebranch.oefos

1010

wb.sciencebranch.value

item.mimetype

application/pdf

item.openairetype

conference paper

item.cerifentitytype

Publications

item.grantfulltext

open

item.languageiso639-1

item.openairecristype

http://purl.org/coar/resource_type/c_5794

item.openaccessfulltext

Open Access

item.fulltext

with Fulltext

crisitem.author.dept

Northeastern University

crisitem.author.dept

Aalborg University

crisitem.author.dept

University of Verona

crisitem.author.dept

Northeastern University

crisitem.author.dept

E192-02 - Forschungsbereich Databases and Artificial Intelligence

crisitem.author.dept

Northeastern University

crisitem.author.orcid

0000-0002-7229-3936

crisitem.author.orcid

0000-0003-3168-6810

crisitem.author.orcid

0000-0001-7922-5998

crisitem.author.orcid

0000-0002-8134-909X

crisitem.author.orcid

0000-0001-7025-8099

crisitem.author.parentorg

E192 - Institut für Logic and Computation

Enthalten in den Sammlungen:

Conference Paper

Volltext (Version of Record (published version))

Adobe PDF

(1.26 MB)

A Large Scale Test Corpus for Semantic Table Search

CC BY 4.0

Zur Kurzanzeige

Google Scholar^TM

Check

Google ScholarTM

Google Scholar^TM