The paper demonstrates the SOURCERY system supporting interactive multi-criteria user-driven source selection. The demonstration of the system will take place in Turin, Italy in October 2018.
I am a research associate in the School of Computer Science at the University of Manchester where I work with Norman Paton and Alvaro Fernandes. I received my PhD from University of Vienna under the guidance of Siegfried Benkner. My main research interests include data integration and data management at scale, autonomic computing, big data and cloud computing. I am working in the EPSRC project VADA (Value added data systems - principles and architectures) on new methods for data wrangling and integration.
PhD in Computer Science, with distinction, 2012
University of Vienna
MSc in Economics and Computer Science, with distinction, 2006
Technical University of Vienna
BSc in Economics and Computer Science, 2004
Technical University of Vienna
The paper demonstrates the SOURCERY system supporting interactive multi-criteria user-driven source selection. The demonstration of the system will take place in Turin, Italy in October 2018.
The paper presents a user-driven approach to source selection that seeks to identify sources that are most fit for purpose. The approach employs a decision support methodology to take account of a user’s context, to allow end users to tune their preferences by specifying the relative importance between different criteria, looking to find a trade-off solution aligned with his/her preferences. The paper is available online and open access.
The paper presents a methodology to fully automate an end-to-end data wrangling process incorporating data context, which associates portions of a target schema with potentially spurious extensional data of types that are commonly available. Data context, i.e. instance-based evidence, together with data profiling paves the way to inform automation in several steps within the wrangling process, specifically, matching, mapping validation, value format transformation, and data repair.
The initial implementation of the VADA architecture was demonstrated at ACM SIGMOD in Chicago on 17th May 2017. The paper that gives an overview of the demonstration is in the proceedings of the conference and a screencast has been produced that gives a flavour of the system.
A paper on a novel architecture for large-scale data analytics pipelines supporting containers, different programming paradigms and storage solutions has been presented at BeyondMR @ ACM SIGMOD in Chicago on 19th May 2017. The paper describes the system developed in the RETIDA project.
Cost-effective data wrangling processes need to ensure that data wrangling steps benefit from automation wherever possible. In this project, I work on the challenge of automating the steps in data wrangling by informing them with data context: data from the domain in which wrangling is taking place.
UK EPRSC project grant, April 2015 - March 2020
Enabling real-time, large-scale analytical workflows for data-intensive science requires the integration of state-of-the-art technologies from Big Data (NoSQL solutions and data-intensive programming, streaming engines), HPC (parallel programming, multi-/many-core architectures, GPUs, clusters), data analytics (analytical models and algorithms), and workflow management (definition and orchestration). As the principal investigator of the project, I established the system architecture and developed a large-scale and massively parallel storage and execution platform for data pipelines.
FFG-Austrian Research Promotion Agency, July 2014 – December 2016
VPH-Share is building a safe, online facility in which medical simulation developers can produce workflows - chains of processing tasks - to allow raw medical data to be refined into meaningful diagnostic and therapeutic information. I have been the lead architect and developer of the VPH-Share data infostructure, a scalable and distributed data mediation engine transparently integrating data from multiple European hospitals.
European Commission, ICT-2009.5.3-269978, March 2011 - February 2015
A complete IT infrastructure has been developeded for the management and processing of the vast amount of heterogeneous data acquired during diagnosis. I contributed to the design and development of @neuInfo, the @neurIST data infrastructure.
European Commission, IST-2004-027703, January 2006 – March 2010
Other H2020, FP7, FP6, ERA-NET and Austrian research projects I contributed to.
I am a teaching the following courses at University of Vienna:
Past teaching activities:
Thu, Mar 22, 2018, VADA project summit
Fri, Sep 15, 2017, VADA project summit
Fri, Sep 8, 2017, EDBT summer school
Thu, May 26, 2016, VADA project summit
Thu, Sep 17, 2015, IDC SEE Forum 2015
Tue, Jun 16, 2015, Big Data for Business Summer School
Wed, Jun 10, 2015, OCG Symposium 2015, Workshop Cloud & Big Data
Thu, May 21, 2015, IDC Datahub 2015
Thu, Nov 20, 2014, Salzburger Data Science Symposium
Wed, Nov 19, 2014, Urban Future
The core of the Retida system has been developed at the AIT mobility department to support the analysis of different types of trajectory data. I extended the data analytics pipeline framework with support for massively parallel programming paradigms and Big data storage systems and developed the architecture towards real-time analytics of huge data sets. The framework has been successfully used to conduct data analysis for several clients and research projects.
VCE is a Cloud framework for exposing virtual appliances as Web services. The VCE follows the Software as a Service (SaaS) model and relies on the concept of virtual appliances to provide a common set of generic interfaces to the user while hiding the details of the underlying software and hardware infrastructure. VCE is the software project leading to my PhD and has been proven to provide a stable core codebase for several European research projects including CPAMMS and VPH-Share. It’s core technology has been deployed and been in service at more than 10 sites in Europe for several years.
The DMP facilitates data management in the VPH-Share project. The DMP supports the data providers with tools to select data to be exposed, semantically annotate the data and securely provide the data to the VPH community by hosting it in the VPH-Share Cloud environment and exposing it via web service, REST and Linked Data interfaces. The DMP has been successfully used over many years to manage and access data of multiple hospitals.
@neuInfo enables access to clinical and epidemiological data distributed in public and project-specific protected databases. @neuInfo provides complex data querying and mediation functionality for the @neurIST Grid infrastructure. Different data sources can be searched using a query user interface providing a semantically unified view on an abstract data model that is linked to the actual physical data sources, allowing direct navigation through application specific knowledge domains like risk assessment, epidemiology and case history. @neuInfo is based on Web service, semantic, and data integration technologies and has been intensively used over several years to connect multiple hospitals in Europe.