Do, B.-L. (2017). Semantic integration and exploration of statistical data [Dissertation, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2017.47741
In recent years, the amount of statistical data available on the web has grown dramatically. Numerous organizations and governments publish statistical data in a multitude of formats and encodings, using different scales, and providing access through a wide range of mechanisms. Due to such inconsistent data publishing practices, analysis of heterogeneous and dispersed statistical data is challenging. This thesis addresses three major challenges to integrate and explore disparate statistical data sources, i.e., (i) data heterogeneity; (ii) interconnection between statistical data sets; and (iii) providing uniform and integrated access to individual data sets. To this end, we rely on semantic technologies, standards, services, and vocabularies. We use RDF data model and Data Cube vocabulary to consolidate heterogeneous data representations. Based on the RDF mapping language, raw data sets are lifted into RDF following the Data Cube vocabulary. To link URIs used in data sources, we define URI design patterns to coin a set of shared URIs. In addition, we develop algorithms to map components (e.g., spatial dimension, temporal dimension) and align values. In order to query and integrate individual data sets, we model statistical data sets in metadata descriptions in a uniform manner. Each metadata description contains information of (i) the detailed data structure and access method for query building, and (ii) link relationships that connect components and values used in the data set to a set of shared URIs. The well-defined metadata, hence, provides a standardized conceptual layer over each data set. Relying on the metadata repository, a mediator provides a semantic integration of and uniform access to multiple heterogeneous data sources. The mediator can transform generic queries into suitable queries for individual data sets, perform scale transformation, rewrite individual results, and integrate them into a consolidated result. We implement this approach in StatSpace, a linked statistical data space that provides uniform access to more than 1,800 data sets published by a variety of data providers including the World Bank, the European Union, and the European Environment Agency. We evaluate our approach in terms of coverage, validity, and performance.