Chandradoss, R. J. (2009). Wrapper adaptation and generic wrappers [Dissertation, Technische Universität Wien]. reposiTUm. http://hdl.handle.net/20.500.12708/177810
Web Extraktion; Wrapper Adaptierung; Schema Matching
de
Web Extraction; Wrapper Adaptation; Schema Matching; Depth First Search
en
Abstract:
Internet is a huge source of information and the amount of data that is posted on it grows exponentially with time. Corporate data that are available on the internet such as crude oil prices in the market, availability of tickets on a commercial airliner, hotels that are in a neighborhood and much more is business-specific facts that carry leverage for business operations. There is a huge interest in the corporate world to access and extract this business-specific real-time data for strategic advantages. Information that is rendered on web pages is usually meant for human consumption. Wrapper is a set of extraction rules that is capable of extracting data automatically from the internet and transforms the same for further processing. For instance The HTML content that is rendered in a web page is extracted and converted to XML. Wrapping is the process of extracting data that has grown from nascent technology to a growing number of commercial products in the last few years.<br />Wrappers use the structural, syntactic and semantic properties exhibited by a web page to identify and acquire relevant data. Web pages change in structure and content over time and this change could vary between cosmetic displacements to a complete rearrangement. There is an undeniable necessity for maintaining wrappers as structural change on a page leaves dependent wrappers unusable. Manual wrapper maintenance is an expensive, cumbersome and error-prone effort that usually exceeds original cost of creating the wrappers.<br />This thesis show application of Extended Depth First Search, variation of Edit Distance algorithm, Metadata of a web page and Schema matching techniques to automatically or semi-automatically adapt wrappers for structural changes on web pages. This thesis explains and expresses wrapper adaptation task as a subset of schema matching. This opens up a great deal of solutions in schema matching arena that could be applied for automatic wrapper adaptation. This thesis explains a hybrid search and mapping technique that applies both content-type and structural information of a web page to reduce wrapper creation effort.