User-guided information extraction from print-oriented documents

Hassan, Tamir

DC Field

Value

Language

dc.contributor.advisor

Gottlob, Georg

dc.contributor.author

Hassan, Tamir

dc.date.accessioned

2020-06-30T04:50:11Z

dc.date.issued

2010

dc.date.submitted

2010-05

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Hassan, T. (2010). <i>User-guided information extraction from print-oriented documents</i> [Dissertation, Technische Universität Wien]. reposiTUm. https://resolver.obvsg.at/urn:nbn:at:at-ubtuw:1-31713</div> </div>

dc.identifier.uri

https://resolver.obvsg.at/urn:nbn:at:at-ubtuw:1-31713

dc.identifier.uri

http://hdl.handle.net/20.500.12708/10524

dc.description

Zsfassung in dt. Sprache

dc.description.abstract

In den letzten Jahren wurden sowohl im akademischen als auch im kommerziellen Umfeld mehrere Systeme für Wrapping, d.h.<br />benutzergeleitete Informationsextraktion, von Webquellen entwickelt. Ein wichtiges Merkmal von Webdokumenten ist ihre inhärente Baumstruktur, welche von Wrappingsystemen wie dem Lixto Visual Wrapper benutzt wird, um Instanzen der zu extrahierenden Daten zu lokalisieren. Da diese Baumstruktur der logischen Struktur des Inhalts einigermaßen entspricht, können derartige Methoden erfolgreich funktionieren.<br />Diese Dissertation beschäftigt sich mit der Erweiterung dieser Wrapping-Techniken auf druckorientierte Dokumente im PDF-Format. Diese Aufgabe stellt eine große Herausforderung dar, da die logische Struktur eines PDFs (üblicherweise) nicht explizit in der Datei vorhanden ist.<br />Der Einsatz von Techniken aus den Bereichen document analysis und document understanding bildet daher ein zentrales Thema dieser Doktorarbeit, um diese Struktur in den in der Darstellung verwendeten Layoutkonventionen wiederzuentdecken.<br />Zwei Ansätze für Wrapping von PDF-Dokumenten werden vorgestellt: der Kon- vertierungsansatz und der graphbasierte Ansatz. Der Konvertierungsansatz beruht auf der automatischen, strukturierten Konvertierung von PDF-Dokumenten in HTML, welche nachfolgend mit dem Lixto Visual Wrapper bearbeitet werden, um die gewünschten Daten zu extrahieren. In diesem Ansatz wird der Schwerpunkt auf die Erkennung von Tabellen und deren strukturierte Repräsentation in HTML gesetzt.<br />Mit dem graphbasierten Ansatz wurde eine ganz neue Methode für die Spezifikation von Wrapper-Programmen direkt auf der physischen Struktur des Dokuments geschaffen, welche mittels eines auf Subgraphisomorphismus basierten Verfahrens die gewünschten Daten extrahiert. Da dieser Ansatz nicht von einer vollständigen, genauen Erkennung von Strukturen abhänging ist, wird das Wrapping von einer größeren Auswahl von Dokumenten mit geringerer Fehleranfälligkeit ermöglicht. Zudem ähnelt die Graphstruktur der physischen Struktur des Dokuments, ist somit für den Benutzer intuitiver zu verstehen und erlaubt die benutzerfreundliche, interaktive Erstellung von Wrapper-Programmen.<br />

dc.description.abstract

In recent years, a number of systems have been developed in the academic and commercial domain for wrapping, or user-guided information extraction, from Web sources. An important feature of Web documents is their inherent tree structure, which is used by wrapping systems such as the Lixto Visual Wrapper to locate instances of the data to be extracted. Because this tree structure somewhat represents the logical structure of the content, such wrapping methods work successfully.<br />This dissertation is concerned with extending these wrapping techniques to PDF documents. This is a challenging task, as the logical structure of a PDF is (usually) not explicitly available in the file. The use of document analysis and document understanding techniques to rediscover this structure from the layout conventions that are used in the document's presentation is therefore a central theme of this thesis.<br />We present two approaches to wrapping PDF documents: the conversion approach and the graph-based approach. The conversion approach is based on an automated, structured conversion of PDF documents into HTML, which are then used as input to the Lixto Visual Wrapper to extract the desired information. In this approach, we place particular emphasis on detecting tables and representing them in a structured manner in HTML.<br />The graph-based approach represents a novel method for specifying wrapping programs directly on the physical structure of the document.<br />Using an algorithm based on subgraph isomorphism, other instances of the data are found. As this approach is not reliant on the complete and accurate detection of structures in the document, it is more robust and enables a much wider range of documents to be wrapped. Because the physical structure is more intuitive for the user, this approach also enables wrapper programs to be created in a user friendly, interactive way.<br />

dc.language

English

dc.language.iso

dc.rights.uri

http://rightsstatements.org/vocab/InC/1.0/

dc.subject

Wrapping

dc.subject

PDF

dc.subject

Informationsextraktion

dc.subject

druckorientiert

dc.subject

Document analysis

dc.subject

Document understanding

dc.subject

Graph-Matching

dc.subject

wrapping

dc.subject

PDF

dc.subject

information extraction

dc.subject

print-oriented

dc.subject

document analysis

dc.subject

document understanding

dc.subject

graph matching

dc.title

User-guided information extraction from print-oriented documents

dc.type

Thesis

dc.type

Hochschulschrift

dc.rights.license

In Copyright

dc.rights.license

Urheberrechtsschutz

dc.contributor.affiliation

TU Wien, Österreich

dc.rights.holder

Tamir Hassan

tuw.version

vor

tuw.thesisinformation

Technische Universität Wien

dc.contributor.assistant

Kappel, Gertrude

tuw.publication.orgunit

E184 - Institut für Informationssysteme

dc.type.qualificationlevel

Doctoral

dc.identifier.libraryid

AC07807583

dc.description.numberOfPages

141

dc.identifier.urn

urn:nbn:at:at-ubtuw:1-31713

dc.thesistype

Dissertation

dc.thesistype

Dissertation

dc.rights.identifier

In Copyright

dc.rights.identifier

Urheberrechtsschutz

tuw.assistant.orcid

0000-0002-4758-9436

item.languageiso639-1

item.openairetype

doctoral thesis

item.grantfulltext

open

item.fulltext

with Fulltext

item.cerifentitytype

Publications

item.mimetype

application/pdf

item.openairecristype

http://purl.org/coar/resource_type/c_db06

item.openaccessfulltext

Open Access

crisitem.author.dept

E186 - Institut für Computergraphik und Algorithmen

crisitem.author.parentorg

E180 - Fakultät für Informatik

Appears in Collections:

Thesis

Fulltext (Version of Record (published version))

Adobe PDF

(7.04 MB)

In Copyright

Show simple item record

Page view(s)

556

checked on Nov 19, 2023

Download(s)

203

checked on Nov 19, 2023

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM