Jozaf, J. (2015). Solving data quality problems in data warehousing by means of data exchange techniques [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2015.28969
Since the concept of data warehousing started to play an important role for bigger companies in the past few years, consequently data quality has became an essential part in the process of collecting, integrating and delivering trust-worthy data to business people. During the time, the subject of data quality has been examined in many theoretical research papers and many of these findings have been successfully applied in practice. However, new challenges have recently appeared, causing bigger demands for more detailed data quality checks including the processing of constantly increasing amount of data. Taking into account this new demands in the overall data quality process, the data validation has became considerably slower than before. The individual data validation checks have to be optimized, so they can achieve better performance. This problem area is the main focus of this thesis and could be described more profoundly with the following question: -How can we validate the correctness of data in a high-performance way, to guarantee the on-time processing of daily operations within data warehouses- Although the currently used techniques and tools for data validation deliver significantly better performance in comparison to previous years, they are still not sufficient to validate the continuously increasing data volume in acceptable time. In our effort to find possible improvements in the currently used data validation algorithms we looked at the recent research results in the field of the data exchange. The algorithmic similarity between data validation and data exchange could serve as an inspiration for examination of research results in the data exchange area, under the consideration that they are put into the role of data validation. In this thesis, we present and examine possibilities of such an implementation of the data exchange solution with the focus on processing of large amount of data quality checks. The aim is to find an algorithm which can minimize the amount of data checks to be executed but preserves its original declarative definition. The approach that was chosen for this thesis, takes a detailed look at the research that has been done in data exchange and tries to find for them the best possible applicability, which should bring added value in the data quality field. This added value is measured on the performance level with respect to data quality standard implementation. As a result of this work, we present the findings and recommendations, but also the possible restriction by the use of data exchange in the data quality area. The presented results are based on a concrete implementation of the proposed algorithm in several data validation scenarios.