Content profiling for digital preservation

Petrov, Petar

Record link:

https://resolver.obvsg.at/urn:nbn:at:at-ubtuw:1-50809
http://hdl.handle.net/20.500.12708/10976

Title:

Content profiling for digital preservation

Citation:

Petrov, P. (2013). Content profiling for digital preservation [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://resolver.obvsg.at/urn:nbn:at:at-ubtuw:1-50809

CatalogPlus:

AC07815471

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Petrov, Petar

Advisor:

Rauber, Andreas

Co-advisor:

Becker, Christoph

Organisational Unit:

E188 - Institut für Softwaretechnik und Interaktive Systeme

Date (published):

2013

Number of Pages:

Keywords:

Content Profiling; Preservation Planning; Langzeitarchivierung; Meta Daten; Skalierbarkeit

Content Profiling; Preservation Planning; Digital Preservation; Meta Data; Scalability; Sample Selection

Abstract:

Informationstechnologien helfen uns, unsere digitalen Inhalte leicht zu verwalten. Dies ist der Grund für die zur Zeit bemerkenswerte digitale Datenproduktion. Allerdings, werden dadurch viele technische und soziale Probleme, die mit der Sicherheit, Langzeitarchivierung und Zugriff zu tun haben, verursacht. Digitale Langzeitarchivierung versucht genau diese Probleme zu lösen, die mit Hardware- und Softwareveralterung zu tun haben, sowie auch den zuküntigen Zugriff zu garantieren. Um eine sinnvolle Entscheidung über die Zukunft von einer digitalen Kollektion zu treffen, muss man einen Planungsprozess befolgen. Das Ergebnis von diesem komplexen Prozess ist ein Langzeitarchivierungsplan. Dieser ist ein Artefakt, das die konkrete Aktionen für die Lang- zeitarchivierung von einer Menge von digitalen Objekten spezifiziert und potentielle Alternati- ven und Gründe für die getroffene Entscheidung umfasst. Die Entscheidung basiert auf Wissen über die Inhalte und auf die Ergebnisse von Evaluirungsexperimenten, die auf Beispielobjekte durchgeführt werden. Aus diesem Grund ist die Erstellung eines Content Profile, das aus einer umfassenden Beschreibung der Kollektion, sowie aus einer kleinen Menge von Beispielobjekten besteht, unbestreitbar entscheidend für einen effektiven Plannungsprozess. Generell besteht Content Profiling aus drei Teilen: Charakterisierung, Aggregation und Ana- lyse. Im ersten Schritt wird eine Identifikation der digitalen Objekten durchgeführt und Meta Daten werden extrahiert. In der Aggregationsphase werden die gesammelten Daten in einer kompressierten Form dargestellt. Im letzten Schritt werden relevante Aspekte der Kollektion durch eine tiefgehende Analyse festgestellt und für Weiterverabeitung bereitgestellt. Experten stehen heutzutage wegen des großen Volumens von Daten vor vielen Herausforde- rungen. Einerseits ist die Charakterisierung von digitalen Objekten ein umständlicher und fehel- rhafter Prozess. Andererseits ist die Aggregation von den Ausgaben unterschiedlichen Werkzeu- gen mit komplexen Schemata eine Aufgabe, die Fachkenntnisse von Experten braucht und die schwerfällig auf großen Skalen ist. Das Fehlen einer umfassenden Beschreibung und Überblick sind oft der Grund für die Auswahl von Zufallsobjekten. Dies führt zur Selektion von Objekten, die nicht repräsentativ sind und kann zur gefälschten Experimenten führen. Nach dem aktuellen Stand der Langzeitarchivierung existieren keine Lösungen, die es erlau- ben einen detaillierten Profil von signifikanten Datensätzen automatisch zu erstellen, repräsen- tative Teilmengen auszuwählen und in einem semi-strukturierten Format darzustellen. In dieser Arbeit betrachten wir die existierenden Lücken im Bereich des Content Profiling und des Planungprozess. Der Beitrag dieser Arbeit besteht darin, eine konzeptionelle Lösung des Problems sowie eine Implementierung in Form eines Prototypen zu erstellen. Der Prototyp kann auf Kollektionen von substantieller Größe arbeiten und hilft bei der tiefgehenden Ana- lyse. Abschließend wird dieser Prototyp anhand zweier Fallstudien von Datenkollektionen mit signifikanten Volumen evaluiert.

Information Technology enables us to organise and manage our digital content into collections in an easy fashion. As a result, massive volumes of data are produced each day. However, it creates a huge set of technical and social issues regarding its safety and long-term accessibility. Digital Preservation copes with issues related to hardware and software obsolescence and tries to keep our digital content accessible in the long term. In order to make a meaningful decision about the course of action that should be chosen for a digital collection, preservation planning is conducted. The result of this rather complex and time-consuming process is a preservation plan. A preservation plan is an artefact that specifies a concrete action for the preservation of a set of objects and includes potential alternative actions and the reasons for the decision-making. The decision is based on knowledge about the content and the evaluation results of experiments conducted over sample objects of the collection. Inarguably, a content profile which is a thorough description of the collection and a small set of representative sample objects, is crucial for effective planning. In general, the content profiling process consists of three parts; characterisation, aggregation and analysis. Characterisation is responsible for the extraction of meta data and the identification of digital objects, while aggregation offers a compressed view on them. In the last step of analysis, relevant aspects of the content are found and presented for further processing by preservation planning. Because of the large volume of data, planners face many technical challenges. On the one hand, characterising millions of digital objects is a cumbersome and error prone process. On the other hand, aggregating output of various characterisation tools with complex output schemas is a highly tedious task that requires the expertise of preservation experts and is almost impossible on large scales. The lack of a thorough description and overview of the data often forces planners to select sample objects at random or based on a single property of the data. This results in subsets that are not representative and could lead to biased experiments. The current state of the art does not offer solutions that are able to automatically create an in-depth profile of a significantly large set of digital objects, select representative samples and expose them in a semi-structured format. In this thesis we observe the existing gaps in terms of content profiling and its importance within preservation planning. The contribution of this work is a conceptual solution of the content profiling problem, how it could be approached and a software prototype implementing the process. The presented prototype can operate on collections of about a million objects in scale. It helps to conduct an in-depth analysis, as well as select sample objects based on different algorithms. We evaluate the prototype using data collections of significant size in two case studies.

Additional information:

Zsfassung in dt. Sprache

License:

In Copyright

Appears in Collections:

Thesis