Filzmoser, P., & Mazak-Huemer, A. (2023). Massive Data Sets – Is Data Quality Still an Issue? In B. Vogel-Heuser & M. Wimmer (Eds.), Digital Transformation (Vol. 1, pp. 269–279). Springer Vieweg. https://doi.org/10.1007/978-3-662-65004-2_11
Data Management; Data Analytics; Model Integration; Cloud Computing; Blockchain
en
Abstract:
The term “big data” has become a buzzword in the last years, and it refers to the possibility to collect and store huge amounts of information, resulting in big data bases and data repositories. This also holds for industrial applications: In a production process, for instance, it is possible to install many sensors and record data in a very high temporal resolution. The amount of information grows rapidly, but not necessarily does the insight into the production process. This is the point where machine learning or, say, statistics needs to enter, because sophisticated algorithms are now required to identify the relevant parameters which are the drivers of the quality of the product, as an example. However, is data quality still an issue? It is clear that with small amounts of data, single outliers or extreme values could affect the algorithms or statistical methods. Can “big data” overcome this problem? In this article we will focus on some specific problems in the regression context, and show that even if many parameters are measured, poor data quality can severely influence the prediction performance of the methods.
en
Research Areas:
Modeling and Simulation: 50% Computational Materials Science: 50%