<div class="csl-bib-body">
<div class="csl-entry">Darmanovic, F., Hanbury, A., & Zlabinger, M. (2023). SCI-3000: A Dataset for Figure, Table and Caption Extraction from Scientific PDFs. In G. Fink, R. Jain, K. Kise, & R. Zanibbi (Eds.), <i>Document Analysis and Recognition - ICDAR 2023 : 17th International Conference, San José, CA, USA, August 21–26, 2023, Proceedings, Part I</i> (pp. 234–251). Springer Cham. https://doi.org/10.1007/978-3-031-41676-7_14</div>
</div>
-
dc.identifier.uri
http://hdl.handle.net/20.500.12708/192601
-
dc.description.abstract
Extracting figures and similar visual elements from PDFs of scientific publications is important but non-trivial, and progress is impeded by a lack of datasets for evaluation and machine learning. In this work, we describe and publish the SCI-3000 dataset, containing 3 000 PDFs of scientific publications (34 791 pages) with annotations of figures, tables, and corresponding captions, from the fields of computer science, biomedicine, chemistry, physics, and technology. We demonstrate the use of the dataset to benchmark two figure, table, and caption extraction approaches from recent literature: one rule-based and one deep learning-based.
en
dc.language.iso
en
-
dc.subject
Caption Extraction
en
dc.subject
Figure Extraction
en
dc.subject
Table Extraction
en
dc.title
SCI-3000: A Dataset for Figure, Table and Caption Extraction from Scientific PDFs
en
dc.type
Inproceedings
en
dc.type
Konferenzbeitrag
de
dc.relation.publication
Document Analysis and Recognition - ICDAR 2023 : 17th International Conference, San José, CA, USA, August 21–26, 2023, Proceedings, Part I
-
dc.contributor.editoraffiliation
TU Dortmund University, Germany
-
dc.contributor.editoraffiliation
Rajiv Jain Cinematography
-
dc.contributor.editoraffiliation
Rochester Institute of Technology, United States of America (the)
-
dc.relation.isbn
978-3-031-41676-7
-
dc.relation.doi
10.1007/978-3-031-41676-7
-
dc.relation.issn
0302-9743
-
dc.description.startpage
234
-
dc.description.endpage
251
-
dc.type.category
Full-Paper Contribution
-
dc.relation.eissn
1611-3349
-
tuw.booktitle
Document Analysis and Recognition - ICDAR 2023 : 17th International Conference, San José, CA, USA, August 21–26, 2023, Proceedings, Part I
-
tuw.container.volume
14187
-
tuw.relation.publisher
Springer Cham
-
tuw.researchTopic.id
I4
-
tuw.researchTopic.name
Information Systems Engineering
-
tuw.researchTopic.value
100
-
tuw.publication.orgunit
E194-04 - Forschungsbereich Data Science
-
tuw.publisher.doi
10.1007/978-3-031-41676-7_14
-
dc.description.numberOfPages
18
-
tuw.author.orcid
0000-0002-7149-5843
-
tuw.author.orcid
0000-0003-2733-3043
-
tuw.editor.orcid
0000-0002-7446-7813
-
tuw.editor.orcid
0000-0002-5868-0147
-
tuw.editor.orcid
0000-0001-5921-9750
-
tuw.event.name
The 17th International Conference on Document Analysis and Recognition - ICDAR2023
en
tuw.event.startdate
21-08-2023
-
tuw.event.enddate
26-08-2023
-
tuw.event.online
On Site
-
tuw.event.type
Event for scientific audience
-
tuw.event.place
San José, Kalifornien
-
tuw.event.country
US
-
tuw.event.presenter
Darmanovic, Filip
-
tuw.event.presenter
Hanbury, Allan
-
tuw.event.presenter
Zlabinger, Markus
-
tuw.event.track
Multi Track
-
wb.sciencebranch
Informatik
-
wb.sciencebranch.oefos
1020
-
wb.sciencebranch.value
100
-
item.languageiso639-1
en
-
item.openairetype
conference paper
-
item.grantfulltext
none
-
item.fulltext
no Fulltext
-
item.cerifentitytype
Publications
-
item.openairecristype
http://purl.org/coar/resource_type/c_5794
-
crisitem.author.dept
E193 - Institut für Visual Computing and Human-Centered Technology
-
crisitem.author.dept
E194-04 - Forschungsbereich Data Science
-
crisitem.author.dept
E194-04 - Forschungsbereich Data Science
-
crisitem.author.orcid
0000-0002-7149-5843
-
crisitem.author.parentorg
E180 - Fakultät für Informatik
-
crisitem.author.parentorg
E194 - Institut für Information Systems Engineering
-
crisitem.author.parentorg
E194 - Institut für Information Systems Engineering