Utilizing phonetic similarity for cross-source and cross-language toponym matching: a benchmark and prototype

Sagi, Tomer; Zaga, Moran; Rusinek, Sinai; Fekete, Marcell Richard; Bjerva, Johannes; Hose, Katja

doi:10.1007/s10579-025-09812-9

DC Field

Value

Language

dc.contributor.author

Sagi, Tomer

dc.contributor.author

Zaga, Moran

dc.contributor.author

Rusinek, Sinai

dc.contributor.author

Fekete, Marcell Richard

dc.contributor.author

Bjerva, Johannes

dc.contributor.author

Hose, Katja

dc.date.accessioned

2025-08-26T14:04:58Z

dc.date.available

2025-08-26T14:04:58Z

dc.date.issued

2025-02-26

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Sagi, T., Zaga, M., Rusinek, S., Fekete, M. R., Bjerva, J., & Hose, K. (2025). Utilizing phonetic similarity for cross-source and cross-language toponym matching: a benchmark and prototype. <i>Language Resources and Evaluation</i>, <i>59</i>(3), 2427–2451. https://doi.org/10.1007/s10579-025-09812-9</div> </div>

dc.identifier.issn

1574-020X

dc.identifier.uri

http://hdl.handle.net/20.500.12708/218552

dc.description.abstract

The writings of one ancient civilization often overlap in time and space with others. Many of these sources comprise unstructured text in ancient languages, causing scholars studying these civilizations to be siloed, often relying on sources in specific languages. Most recent efforts to extract structured information from historical scripts into place (toponym) and people databases (prospographies) have followed this pattern, focusing on one civilization and selected sources. The path to creating a common database runs through aligning names or toponyms between sources from disparate languages utilizing different scripts. Existing multi-lingual orthographic (string-based) comparison often relies on transliteration to a common script (Latin/English). Transliteration often creates multiple options and even more confusion. However, when integrating sources that overlap in space and time, the languages often share a common phonetic background. This commonality may prove beneficial. In this work, we present a benchmark for comparing toponyms from two linguistically and culturally related languages, namely Hebrew and Arabic. We provide a benchmark comprised of a set of dataset pairs created from historical sources written in Medieval variants of these languages, later historical Gazetteers and a modern dataset curated from Wikidata. We empirically evaluate several toponym comparison approaches over the benchmark: transliteration to a common script, direct transliteration, and phonetic comparison using a common phonetic representation. We discuss the results and the limitations of the various methods and outline future work.

dc.language.iso

dc.publisher

SPRINGER

dc.relation.ispartof

Language Resources and Evaluation

dc.subject

Grapheme to phoneme

dc.subject

Multi-lingual

dc.subject

Toponym matching

dc.subject

Transliteration

dc.title

Utilizing phonetic similarity for cross-source and cross-language toponym matching: a benchmark and prototype

dc.type

Article

dc.type

Artikel

dc.identifier.scopus

2-s2.0-85218777714

dc.identifier.url

https://api.elsevier.com/content/abstract/scopus_id/85218777714

dc.contributor.affiliation

Aalborg University, Denmark

dc.contributor.affiliation

University of Haifa, Israel

dc.contributor.affiliation

University of Haifa, Israel

dc.contributor.affiliation

Aalborg University, Denmark

dc.contributor.affiliation

Aalborg University, Denmark

dc.description.startpage

2427

dc.description.endpage

2451

dc.type.category

Original Research Article

tuw.container.volume

tuw.container.issue

tuw.journal.peerreviewed

true

tuw.peerreviewed

true

wb.publication.intCoWork

International Co-publication

tuw.researchTopic.id

tuw.researchTopic.name

Logic and Computation

tuw.researchTopic.name

Information Systems Engineering

tuw.researchTopic.value

dcterms.isPartOf.title

Language Resources and Evaluation

tuw.publication.orgunit

E192-02 - Forschungsbereich Databases and Artificial Intelligence

tuw.publisher.doi

10.1007/s10579-025-09812-9

dc.date.onlinefirst

2025

dc.identifier.eissn

1574-0218

dc.description.numberOfPages

tuw.author.orcid

0000-0002-2197-116X

tuw.author.orcid

0000-0003-1043-7954

tuw.author.orcid

0009-0007-5025-7866

tuw.author.orcid

0000-0001-7025-8099

wb.sci

true

wb.sciencebranch

Informatik

wb.sciencebranch

Mathematik

wb.sciencebranch.oefos

1020

wb.sciencebranch.oefos

1010

wb.sciencebranch.value

item.openairecristype

http://purl.org/coar/resource_type/c_2df8fbb1

item.cerifentitytype

Publications

item.openairetype

research article

item.fulltext

no Fulltext

item.languageiso639-1

item.grantfulltext

none

crisitem.author.dept

E192-02 - Forschungsbereich Databases and Artificial Intelligence

crisitem.author.dept

University of Haifa

crisitem.author.dept

University of Haifa

crisitem.author.dept

Aalborg University

crisitem.author.dept

Aalborg University

crisitem.author.dept

E192-02 - Forschungsbereich Databases and Artificial Intelligence

crisitem.author.orcid

0000-0002-2197-116X

crisitem.author.orcid

0000-0003-1043-7954

crisitem.author.orcid

0009-0007-5025-7866

crisitem.author.orcid

0000-0001-7025-8099

crisitem.author.parentorg

E192 - Institut für Logic and Computation

crisitem.author.parentorg

E192 - Institut für Logic and Computation

Appears in Collections:

Article

Show simple item record

Page view(s)

checked on Aug 26, 2025

Google Scholar^TM

Check

Page view(s)

Google ScholarTM

Google Scholar^TM