Extraktion von bibliographischen Informationen aus PDFs von Publikationen

Purker, Angela

DC Field

Value

Language

dc.contributor.advisor

Hanbury, Allan

dc.contributor.author

Purker, Angela

dc.date.accessioned

2022-09-02T06:03:46Z

dc.date.issued

2020

dc.date.submitted

2020

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Purker, A. (2020). <i>Extraktion von bibliographischen Informationen aus PDFs von Publikationen</i> [Diploma Thesis, Technische Universität Wien]. reposiTUm. http://hdl.handle.net/20.500.12708/79855</div> </div>

dc.identifier.uri

http://hdl.handle.net/20.500.12708/79855

dc.description.abstract

Das Portable Document Format (PDF) hat sich in den letzten Jahrzehnten als de facto Standard zum digitalen Dokumentaustausch durchgesetzt. Publikationen im wissenschaftlichen Bereich sind zu einem großen Teil in diesem Format. Das PDF wurde vorallem für die Vorbereitung zum Druck entworfen, weshalb bibliographische Daten wie Autoren, Abstract, Kapitel und Referenzen nicht direkt aus der PDF-Datei gelesen werden können. Um die immer größer werdende Anzahl an Publikationen organisieren zu können, werden Tools zur Extraktion dieser Metadaten benötigt. In dieser Arbeit soll eine Extraktionsmethode gefunden werden, die möglichst zufriedenstellende Ergebnisse bei der Extraktion dieser Daten aus Publikationen der TU Wien liefert. Dazu wird eine Ground Truth von 100 Publikationen der Fakultät für Informatik erstellt, mit der die bestehenden Extraktions-Frameworks Cermine, GROBID, ParsCit und PDFX evaluiert werden. Anhand der Evaluierungsergebnisse werden die Erkenntnisse genutzt um die bestehenden Methoden zu verbessern. Eine erneute Evaluierung dieser Verbesserungen hat ergeben, dass ein weiteres Training des GROBID Modells mit zusätzlich erstellten Trainingsdateien, keine signifikante Verbesserung ergibt. Da ein hoher Aufwand notwendig ist, um Trainingsdateien zu annotieren, wäre es wünschenswert die Erstellung in Zukunft automatisieren zu können. Die extrahierten Daten könnten in Zukunft genutzt werden um eine Zitatsdatenbank für die Publikationen der TU Wien zu erstellen und dadurch Aussagen über Verbindungen zwischen Publikationen machen zu können.

dc.description.abstract

The Portable Document Format (PDF) has developed to be a defacto standard for the exchange of digital documents. Scientific publications use this format to a big extend. The PDF was escpecially developed to prepare the print layout of a document, which is why bibliographic data like authors, abstract, chapters or references can not be read directly from this format. To be able to organize the constantly increasing number of scientific documents, tools for the extraction of this metadata are needed. In this thesis an extraction method should be found, which provides the most effective results for the metadata extraction of publications of the TU Vienna. Therefor a ground truth of 100 publications of the facultiy of informatics is created, which is used to evaluate the already existing extraction frameworks Cermine, GROBID, ParsCit und PDFX. These results are used to enhance GROBID. A further evaluation of the enhancements reveals, that the creation of additional training files for the GROBID models lead to no increasement of the F1 value. Due to the big effort for the annotation of trainingdata, it would be desireable to automate this step. In the future the result of this thesis, the extracted data could be used as input for the creation of a citation database for publications of the TU Vienna.

dc.format

xv, 67 Seiten

dc.language

Deutsch

dc.language.iso

dc.subject

Informationsextraktion

dc.subject

bibliographische Informationen

dc.subject

PDF-Dateien

dc.subject

Medizinische Publikationen

dc.subject

Information Extraction

dc.subject

bibliographic information

dc.subject

PDF Files

dc.subject

Medical publications

dc.title

Extraktion von bibliographischen Informationen aus PDFs von Publikationen

dc.title.alternative

Extraction of bibliographic information from PDFs of publications

dc.type

Thesis

dc.type

Hochschulschrift

dc.contributor.affiliation

TU Wien, Österreich

dc.publisher.place

Wien

tuw.thesisinformation

Technische Universität Wien

tuw.publication.orgunit

E188 - Institut für Softwaretechnik und Interaktive Systeme

dc.type.qualificationlevel

Diploma

dc.identifier.libraryid

AC16077475

dc.description.numberOfPages

dc.thesistype

Diplomarbeit

dc.thesistype

Diploma Thesis

tuw.advisor.staffStatus

staff

tuw.advisor.orcid

0000-0002-7149-5843

item.languageiso639-1

item.openairetype

master thesis

item.grantfulltext

none

item.fulltext

no Fulltext

item.cerifentitytype

Publications

item.openairecristype

http://purl.org/coar/resource_type/c_bdcc

crisitem.author.dept

TU Wien

Appears in Collections:

Thesis

Show simple item record

Page view(s)

258

checked on Nov 23, 2023

Google Scholar^TM

Check

Page view(s)

Google ScholarTM

Google Scholar^TM