The paper presents a methodology to fully automate an end-to-end data wrangling process incorporating data context, which associates portions of a target schema with potentially spurious extensional data of types that are commonly available. Data context, i.e. instance-based evidence, together with data profiling paves the way to inform automation in several steps within the wrangling process, specifically, matching, mapping validation, value format transformation, and data repair.
The paper presents a user-driven approach to source selection that seeks to identify sources that are most fit for purpose. The approach employs a decision support methodology to take account of a user’s context, to allow end users to tune their preferences by specifying the relative importance between different criteria, looking to find a trade-off solution aligned with his/her preferences. The paper is available online and open access.
The initial implementation of the VADA architecture was demonstrated at ACM SIGMOD in Chicago on 17th May 2017. The paper that gives an overview of the demonstration is in the proceedings of the conference and a screencast has been produced that gives a flavour of the system.
A paper on a novel architecture for large-scale data analytics pipelines supporting containers, different programming paradigms and storage solutions has been presented at BeyondMR @ ACM SIGMOD in Chicago on 19th May 2017. The paper describes the system developed in the RETIDA project.
The meetup group Austrian Big Data Forum organises a Barcamp Event with short Lightning Talks and long discussions about your topics in Vienna. The event is hosted by the Austrian computer society (OCG) and takes place May 30, 2017.
I started a new position at the School of Computer Science, University of Manchester to work on the VADA: Value Added Data Systems - Principles and Architectures. My main responsibilities will be research on data integration and wrangling at scale.