VADA: an architecture for end user informed data preparation

Konstantinou, Nikolaos; Abel, Edward; Bellomarini, Luigi; Bogatu, Alex Teodor; Civili, Cristina; Irfanie, Endri; Köhler, Martin; Lacramioara, Mazilu; Sallinger, Emanuel; Fernandes, Alvaro A. A.; Gottlob, Georg; Keane, John A.; Paton, Norman W.

doi:10.1186/s40537-019-0237-9

DC Field

Value

Language

dc.contributor.author

Konstantinou, Nikolaos

dc.contributor.author

Abel, Edward

dc.contributor.author

Bellomarini, Luigi

dc.contributor.author

Bogatu, Alex Teodor

dc.contributor.author

Civili, Cristina

dc.contributor.author

Irfanie, Endri

dc.contributor.author

Köhler, Martin

dc.contributor.author

Lacramioara, Mazilu

dc.contributor.author

Sallinger, Emanuel

dc.contributor.author

Fernandes, Alvaro A. A.

dc.contributor.author

Gottlob, Georg

dc.contributor.author

Keane, John A.

dc.contributor.author

Paton, Norman W.

dc.date.accessioned

2023-01-30T15:15:40Z

dc.date.available

2023-01-30T15:15:40Z

dc.date.issued

2019

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Konstantinou, N., Abel, E., Bellomarini, L., Bogatu, A. T., Civili, C., Irfanie, E., Köhler, M., Lacramioara, M., Sallinger, E., Fernandes, A. A. A., Gottlob, G., Keane, J. A., & Paton, N. W. (2019). VADA: an architecture for end user informed data preparation. <i>Journal Of Big Data</i>, <i>6</i>(74). https://doi.org/10.1186/s40537-019-0237-9</div> </div>

dc.identifier.uri

http://hdl.handle.net/20.500.12708/143242

dc.description.abstract

Background: Data scientists spend considerable amounts of time preparing data for analysis. Data preparation is labour intensive because the data scientist typically takes fine grained control over each aspect of each step in the process, motivating the development of techniques that seek to reduce this burden. Results: This paper presents an architecture in which the data scientist need only describe the intended outcome of the data preparation process, leaving the software to determine how best to bring about the outcome. Key wrangling decisions on matching, mapping generation, mapping selection, format transformation and data repair are taken by the system, and the user need only provide: (i) the schema of the data target; (ii) partial representative instance data aligned with the target; (iii) criteria to be prioritised when populating the target; and (iv) feedback on candidate results. To support this, the proposed architecture dynamically orchestrates a collection of loosely coupled wrangling components, in which the orchestration is declaratively specified and includes self-tuning of component parameters. Conclusion: This paper describes a data preparation architecture that has been designed to reduce the cost of data preparation through the provision of a central role for automation. An empirical evaluation with deep web and open government data investigates the quality and suitability of the wrangling result, the cost-effectiveness of the approach the impact of self-tuning, and scalability with respect to the numbers of sources.

dc.relation.ispartof

Journal Of Big Data

dc.subject

Hardware and Architecture

dc.subject

Information Systems

dc.subject

Data integration

dc.subject

Computer Networks and Communications

dc.subject

Data quality

dc.subject

Information Systems and Management

dc.subject

Data preparation

dc.title

VADA: an architecture for end user informed data preparation

dc.type

Artikel

dc.type

Article

dc.contributor.affiliation

University of Manchester, United Kingdom of Great Britain and Northern Ireland (the)

dc.contributor.affiliation

Samsung (United Kingdom), United Kingdom of Great Britain and Northern Ireland (the)

dc.type.category

Original Research Article

tuw.container.volume

tuw.container.issue

tuw.journal.peerreviewed

true

tuw.peerreviewed

true

wb.publication.intCoWork

International Co-publication

tuw.researchTopic.id

tuw.researchTopic.name

Logic and Computation

tuw.researchTopic.value

100

dcterms.isPartOf.title

Journal Of Big Data

tuw.publication.orgunit

E192-02 - Forschungsbereich Databases and Artificial Intelligence

tuw.publisher.doi

10.1186/s40537-019-0237-9

dc.identifier.eissn

2196-1115

dc.description.numberOfPages

tuw.author.orcid

0000-0001-6863-0162

tuw.author.orcid

0000-0003-1604-5097

tuw.author.orcid

0000-0002-4357-3509

wb.sci

false

wb.sciencebranch

Informatik

wb.sciencebranch.oefos

1020

wb.facultyfocus

Logic and Computation (LC)

wb.facultyfocus

Logic and Computation (LC)

wb.facultyfocus.faculty

E180

item.cerifentitytype

Publications

item.fulltext

no Fulltext

item.openairetype

research article

item.openairecristype

http://purl.org/coar/resource_type/c_2df8fbb1

item.grantfulltext

none

crisitem.author.dept

University of Manchester

crisitem.author.dept

Bank of Italy

crisitem.author.dept

Samsung (United Kingdom)

crisitem.author.dept

E192-02 - Forschungsbereich Databases and Artificial Intelligence

crisitem.author.dept

E192-02 - Forschungsbereich Databases and Artificial Intelligence

crisitem.author.orcid

0000-0001-6863-0162

crisitem.author.orcid

0000-0003-1604-5097

crisitem.author.orcid

0000-0002-4357-3509

crisitem.author.parentorg

E192 - Institut für Logic and Computation

crisitem.author.parentorg

E192 - Institut für Logic and Computation

Appears in Collections:

Article

Show simple item record

Page view(s)

380

checked on Nov 21, 2023

Download(s)

checked on Nov 21, 2023

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM