SCI-3000: A novel dataset for the task of figure, table, and caption extraction from scientific PDFs

Darmanovic, Filip

doi:10.34726/hss.2022.94800

Record link:

https://doi.org/10.34726/hss.2022.94800
http://hdl.handle.net/20.500.12708/81300

Title:

SCI-3000: A novel dataset for the task of figure, table, and caption extraction from scientific PDFs

Citation:

Darmanovic, F. (2022). SCI-3000: A novel dataset for the task of figure, table, and caption extraction from scientific PDFs [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2022.94800

reposiTUm DOI:

10.34726/hss.2022.94800

CatalogPlus:

AC16667011

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Darmanovic, Filip

Advisor:

Hanbury, Allan

Co-advisor:

Zlabinger, Markus

Organisational Unit:

E194 - Institut für Information Systems Engineering

Date (published):

2022

Number of Pages:

Keywords:

Page Object Detection; Figure Extraction; Table Extraction; Caption Extraction; PDF

Abstract:

Durch den ständigen Anstieg an visuell-dargestellten Informationen in wissenschaftlichen Publikationen, steigt auch die Nachfrage, diese Informationen maschinell verarbeitbar zu machen.Anwendungen für Bilder und ähnliche visuellen Objekte reicht von Suchmachinen bis hin zu crossmedialen Machine Learning Ansätzen. Da die Aufgabe der Extrahierung von Elementen aus Dokumenten, sogar nativ-digitalen, keine triviale Tätigkeit ist, hat sich ein ganzer Forschungsfeld drum entwickelt. Wegen einem Mangel an Datensätzen für Evaluierung und Machine Learning ist aber der Fortschritt in diesem Forschungsfeld beeinträchtigt. In dieser Publikation annotieren wir Figuren, Tabellen, und Bildunterschriften in einem Korpus mit 3000 Publikationen aus den Forschungsfeldern Informatik, Biomedizin, Chemie, Physik, und Technologie, mithilfe der Crowd-Sourcing Platform Amazon Mechanical Turk (AMT). Wir veröffentlichen diese Annotationen zusammen mit den dazugehörigen Publikationen in einem Datensatz namens SCI-3000. Dieser Datensatz wird dann zum Vergleich von zwei neuartigen Ansätzen für die Extrahierung von Bilder, Tabellen und Bildunterschriften eingesetzt. Einer von diesen Ansätzen ist regelbasiert, und einer ist Deep Learning-basiert. Der letztgennante Ansatz war der bessere von den Beiden, mit einem durchnittlichen F1-Score von 0.78. Dieses Ergebnis deutet darauf hin, dass Deep-Learning Ansätze bei der Suche nach mehr Effizienz im Fokus bleiben sollten, besonders wenn es um Bildunterschriftextrahierung geht.

With the amount of information presented visually in scientific publications constantly on the rise, the demand for making this information machine-actionable is also rising.Usages for figures and similar visual elements range from search engines to cross-media machine learning approaches. As the task of extracting objects from documents, even born-digital ones, is non-trivial, an entire research field has formed around solving it. However, progress is impeded by a lack of datasets for evaluation and machine learning. In this work, we use the crowd-sourcing platform Amazon Mechanical Turk (AMT) to annotate figures, tables, and corresponding captions in a corpus of 3000 publications from the fields of computer science, biomedicine, chemistry, physics, and technology. We release these annotations together with their source publications in a dataset we call SCI-3000. This dataset is then used to benchmark two figure, table, and caption extraction approaches from recent literature: one rule-based, and one deep learning-based. The latter approach performed better of the two, with an average F1 score of 0.78, suggesting that deep-learning approaches should be explored further in the pursuit of higher efficacy, especially in the task of caption extraction.

License:

In Copyright

Appears in Collections:

Thesis