SCI-3000: A Dataset for Figure, Table and Caption Extraction from Scientific PDFs

Darmanovic, Filip; Hanbury, Allan; Zlabinger, Markus

doi:10.1007/978-3-031-41676-7_14

DC Field

Value

Language

dc.contributor.author

Darmanovic, Filip

dc.contributor.author

Hanbury, Allan

dc.contributor.author

Zlabinger, Markus

dc.contributor.editor

Fink, Gernot

dc.contributor.editor

Jain, Rajiv

dc.contributor.editor

Kise, Koichi

dc.contributor.editor

Zanibbi, Richard

dc.date.accessioned

2024-01-23T14:30:06Z

dc.date.available

2024-01-23T14:30:06Z

dc.date.issued

2023-08-19

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Darmanovic, F., Hanbury, A., & Zlabinger, M. (2023). SCI-3000: A Dataset for Figure, Table and Caption Extraction from Scientific PDFs. In G. Fink, R. Jain, K. Kise, & R. Zanibbi (Eds.), <i>Document Analysis and Recognition - ICDAR 2023 : 17th International Conference, San José, CA, USA, August 21–26, 2023, Proceedings, Part I</i> (pp. 234–251). Springer Cham. https://doi.org/10.1007/978-3-031-41676-7_14</div> </div>

dc.identifier.uri

http://hdl.handle.net/20.500.12708/192601

dc.description.abstract

Extracting figures and similar visual elements from PDFs of scientific publications is important but non-trivial, and progress is impeded by a lack of datasets for evaluation and machine learning. In this work, we describe and publish the SCI-3000 dataset, containing 3 000 PDFs of scientific publications (34 791 pages) with annotations of figures, tables, and corresponding captions, from the fields of computer science, biomedicine, chemistry, physics, and technology. We demonstrate the use of the dataset to benchmark two figure, table, and caption extraction approaches from recent literature: one rule-based and one deep learning-based.

dc.language.iso

dc.subject

Caption Extraction

dc.subject

Figure Extraction

dc.subject

Table Extraction

dc.title

SCI-3000: A Dataset for Figure, Table and Caption Extraction from Scientific PDFs

dc.type

Inproceedings

dc.type

Konferenzbeitrag

dc.relation.publication

Document Analysis and Recognition - ICDAR 2023 : 17th International Conference, San José, CA, USA, August 21–26, 2023, Proceedings, Part I

dc.contributor.editoraffiliation

TU Dortmund University, Germany

dc.contributor.editoraffiliation

Rajiv Jain Cinematography

dc.contributor.editoraffiliation

Rochester Institute of Technology, United States of America (the)

dc.relation.isbn

978-3-031-41676-7

dc.relation.doi

10.1007/978-3-031-41676-7

dc.relation.issn

0302-9743

dc.description.startpage

234

dc.description.endpage

251

dc.type.category

Full-Paper Contribution

dc.relation.eissn

1611-3349

tuw.booktitle

Document Analysis and Recognition - ICDAR 2023 : 17th International Conference, San José, CA, USA, August 21–26, 2023, Proceedings, Part I

tuw.container.volume

14187

tuw.relation.publisher

Springer Cham

tuw.researchTopic.id

tuw.researchTopic.name

Information Systems Engineering

tuw.researchTopic.value

100

tuw.publication.orgunit

E194-04 - Forschungsbereich Data Science

tuw.publisher.doi

10.1007/978-3-031-41676-7_14

dc.description.numberOfPages

tuw.author.orcid

0000-0002-7149-5843

tuw.author.orcid

0000-0003-2733-3043

tuw.editor.orcid

0000-0002-7446-7813

tuw.editor.orcid

0000-0002-5868-0147

tuw.editor.orcid

0000-0001-5921-9750

tuw.event.name

The 17th International Conference on Document Analysis and Recognition - ICDAR2023

tuw.event.startdate

21-08-2023

tuw.event.enddate

26-08-2023

tuw.event.online

On Site

tuw.event.type

Event for scientific audience

tuw.event.place

San José, Kalifornien

tuw.event.country

tuw.event.presenter

Darmanovic, Filip

tuw.event.presenter

Hanbury, Allan

tuw.event.presenter

Zlabinger, Markus

tuw.event.track

Multi Track

wb.sciencebranch

Informatik

wb.sciencebranch.oefos

1020

wb.sciencebranch.value

100

item.languageiso639-1

item.openairetype

conference paper

item.grantfulltext

none

item.fulltext

no Fulltext

item.cerifentitytype

Publications

item.openairecristype

http://purl.org/coar/resource_type/c_5794

crisitem.author.dept

E193 - Institut für Visual Computing and Human-Centered Technology

crisitem.author.dept

E194-04 - Forschungsbereich Data Science

crisitem.author.dept

E194-04 - Forschungsbereich Data Science

crisitem.author.orcid

0000-0002-7149-5843

crisitem.author.parentorg

E180 - Fakultät für Informatik

crisitem.author.parentorg

E194 - Institut für Information Systems Engineering

crisitem.author.parentorg

E194 - Institut für Information Systems Engineering

Appears in Collections:

Conference Paper

Show simple item record

Page view(s)

213

checked on Jan 23, 2024

Download(s)

checked on Jan 23, 2024

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM