Massive Data Sets – Is Data Quality Still an Issue?

Filzmoser, Peter; Mazak-Huemer, Alexandra

doi:10.1007/978-3-662-65004-2_11

DC Field

Value

Language

dc.contributor.author

Filzmoser, Peter

dc.contributor.author

Mazak-Huemer, Alexandra

dc.contributor.editor

Vogel-Heuser, Birgit

dc.contributor.editor

Wimmer, Manuel

dc.date.accessioned

2023-04-03T15:06:08Z

dc.date.available

2023-04-03T15:06:08Z

dc.date.issued

2023

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Filzmoser, P., & Mazak-Huemer, A. (2023). Massive Data Sets – Is Data Quality Still an Issue? In B. Vogel-Heuser & M. Wimmer (Eds.), <i>Digital Transformation</i> (Vol. 1, pp. 269–279). Springer Vieweg. https://doi.org/10.1007/978-3-662-65004-2_11</div> </div>

dc.identifier.uri

http://hdl.handle.net/20.500.12708/175993

dc.description.abstract

The term “big data” has become a buzzword in the last years, and it refers to the possibility to collect and store huge amounts of information, resulting in big data bases and data repositories. This also holds for industrial applications: In a production process, for instance, it is possible to install many sensors and record data in a very high temporal resolution. The amount of information grows rapidly, but not necessarily does the insight into the production process. This is the point where machine learning or, say, statistics needs to enter, because sophisticated algorithms are now required to identify the relevant parameters which are the drivers of the quality of the product, as an example. However, is data quality still an issue? It is clear that with small amounts of data, single outliers or extreme values could affect the algorithms or statistical methods. Can “big data” overcome this problem? In this article we will focus on some specific problems in the regression context, and show that even if many parameters are measured, poor data quality can severely influence the prediction performance of the methods.

dc.language.iso

dc.subject

Data Management

dc.subject

Data Analytics

dc.subject

Model Integration

dc.subject