Recent Posts

More Posts

The paper demonstrates the SOURCERY system supporting interactive multi-criteria user-driven source selection. The demonstration of the system will take place in Turin, Italy in October 2018.

CONTINUE READING

The paper presents a user-driven approach to source selection that seeks to identify sources that are most fit for purpose. The approach employs a decision support methodology to take account of a user’s context, to allow end users to tune their preferences by specifying the relative importance between different criteria, looking to find a trade-off solution aligned with his/her preferences. The paper is available online and open access.

CONTINUE READING

The paper presents a methodology to fully automate an end-to-end data wrangling process incorporating data context, which associates portions of a target schema with potentially spurious extensional data of types that are commonly available. Data context, i.e. instance-based evidence, together with data profiling paves the way to inform automation in several steps within the wrangling process, specifically, matching, mapping validation, value format transformation, and data repair.

CONTINUE READING

The initial implementation of the VADA architecture was demonstrated at ACM SIGMOD in Chicago on 17th May 2017. The paper that gives an overview of the demonstration is in the proceedings of the conference and a screencast has been produced that gives a flavour of the system.

CONTINUE READING

A paper on a novel architecture for large-scale data analytics pipelines supporting containers, different programming paradigms and storage solutions has been presented at BeyondMR @ ACM SIGMOD in Chicago on 19th May 2017. The paper describes the system developed in the RETIDA project.

CONTINUE READING

Selected Publications

Data scientists are usually interested in a subset of sources with properties that are most aligned to intended data use. The SOURCERY system supports interactive multi-criteria user-driven source selection. SOURCERY allows a user to identify criteria they consider of importance and indicate their relative importance, and seeks a source selection result aligned to the user-supplied criteria preferences. The user is given an overview of the properties of the sources that are selected along with visual analyses contextualizing the result in relation to what is theoretically possible and what is possible given the set of available sources. The system also enables a user to interactively perform iterative fine-tuning to explore how changes to preferences may impact results.
In Proceedings of CIKM, 2018

Source selection is the problem of identifying a subset of available data sources that best meet a user’s needs. In this paper we propose a user-driven approach to source selection that seeks to identify sources that are most fit for purpose. The approach employs a decision support methodology to take account of a user’s context, to allow end users to tune their preferences by specifying the relative importance between different criteria, looking to find a trade-off solution aligned with his/her preferences. The approach is extensible to incorporate diverse criteria, not drawn from a fixed set, and solutions can use a subset of the data from each selected source, rather than require that sources are used in their entirety or not at all.
In Information Sciences, Elsevier, 2018

In this paper, we define a methodology to fully automate an end-to-end data wrangling process incorporating data context, which associates portions of a target schema with potentially spurious extensional data of types that are commonly available. Instance-based evidence together with data profiling paves the way to inform automation in several steps within the wrangling process, specifically, matching, mapping validation, value format transformation, and data repair. The approach is evaluated with real estate data showing substantial improvements in the results of automated wrangling.
In IEEE International Conference on Big Data (IEEE Big Data 2017) , 2017

In this paper, we present an architecture that supports a complete data wrangling lifecycle, orchestrates components dynamically, builds on automation wherever possible, is informed by whatever data is available, refines automatically produced results in the light of feedback, takes into account the user’s priorities, and supports data scientists with diverse skill sets.
In SIGMOD’17

The centerpiece is a framework of containerized execution units and management thereof for satisfying the diverse requirements of data analytics pipelines and its stages. Containers not only ease distribution and deployment of applications, but, more importantly enable an efficient synthesis of different stage implementation variants aimed towards exploiting heterogeneous computing resources. Consequently, this approach allows the infrastructure to utilize mainstream data and compute-intensive techniques and paradigms to achieve the goal of efficient pipeline execution.
In SIGMOD’17 Workshops

This paper presents a self-configuring adaptive framework optimizing resource utilization for scientific applications on top of Cloud technologies. The proposed approach relies on the concept of utility, i.e., measuring the usefulness, and leverages the well-established principle from autonomic computing, namely the MAPE-K loop, in order to adaptively configure scientific applications. The proposed framework self-configures the layers by evaluating monitored resources, analyzing their state, and generating an execution plan on a per job basis.
SpringerOpen Journal of Cloud Computing, 2014

This study analyses the extraordinary, innovative potential of Big Data technologies for the Austrian market ranging from managing the data deluge to semantic and cognitive systems. Moreover, the study identifies emerging opportunities arising from the utilization of publicly available data, such as Open Government Data, and company internal data by covering multiple domains.
Open access technical report for the Austrian Government, 2014

In this paper we present the VPH-Share data management platform enabling sharing VPH-relevant datasets within the community on the basis of Cloud technologies. The data management platform aims at supporting the data management life-cycle but starting from already available data.
In IEEE SKG’12

In this article we describe the Vienna Cloud Environment (VCE), a service-oriented Cloud infrastructure based on standard virtualization and Web service technologies for the provisioning of scientific applications, data sources, and scientific workflows as virtual appliances that hide the details of the underlying software and hardware infrastructure.
In IEEE ICAS’11

All Publications

A full list of my publications can be found here:

Projects

  • VADA: Value added data systems – principles and architecture

    Cost-effective data wrangling processes need to ensure that data wrangling steps benefit from automation wherever possible. In this project, I work on the challenge of automating the steps in data wrangling by informing them with data context: data from the domain in which wrangling is taking place.
    UK EPRSC project grant, April 2015 - March 2020

  • Retida - Real-Time Data Analytics in the Mobility Domain

    Enabling real-time, large-scale analytical workflows for data-intensive science requires the integration of state-of-the-art technologies from Big Data (NoSQL solutions and data-intensive programming, streaming engines), HPC (parallel programming, multi-/many-core architectures, GPUs, clusters), data analytics (analytical models and algorithms), and workflow management (definition and orchestration). As the principal investigator of the project, I established the system architecture and developed a large-scale and massively parallel storage and execution platform for data pipelines.
    FFG-Austrian Research Promotion Agency, July 2014 – December 2016

  • VPH-Share - Sharing for Healthcare

    VPH-Share is building a safe, online facility in which medical simulation developers can produce workflows - chains of processing tasks - to allow raw medical data to be refined into meaningful diagnostic and therapeutic information. I have been the lead architect and developer of the VPH-Share data infostructure, a scalable and distributed data mediation engine transparently integrating data from multiple European hospitals.
    European Commission, ICT-2009.5.3-269978, March 2011 - February 2015

  • @neurIST

    A complete IT infrastructure has been developeded for the management and processing of the vast amount of heterogeneous data acquired during diagnosis. I contributed to the design and development of @neuInfo, the @neurIST data infrastructure.
    European Commission, IST-2004-027703, January 2006 – March 2010

  • Other projects

    Other H2020, FP7, FP6, ERA-NET and Austrian research projects I contributed to.

Teaching

I am a teaching the following courses at University of Vienna:

Past teaching activities:

Recent & Upcoming Talks

More Talks

Software

  • Retida - Scalable and massively parallel data pipeline

    The core of the Retida system has been developed at the AIT mobility department to support the analysis of different types of trajectory data. I extended the data analytics pipeline framework with support for massively parallel programming paradigms and Big data storage systems and developed the architecture towards real-time analytics of huge data sets. The framework has been successfully used to conduct data analysis for several clients and research projects.

  • VCE - The Vienna Cloud Environment

    VCE is a Cloud framework for exposing virtual appliances as Web services. The VCE follows the Software as a Service (SaaS) model and relies on the concept of virtual appliances to provide a common set of generic interfaces to the user while hiding the details of the underlying software and hardware infrastructure. VCE is the software project leading to my PhD and has been proven to provide a stable core codebase for several European research projects including CPAMMS and VPH-Share. It’s core technology has been deployed and been in service at more than 10 sites in Europe for several years.

  • VPH-Share data management platform (DMP)

    The DMP facilitates data management in the VPH-Share project. The DMP supports the data providers with tools to select data to be exposed, semantically annotate the data and securely provide the data to the VPH community by hosting it in the VPH-Share Cloud environment and exposing it via web service, REST and Linked Data interfaces. The DMP has been successfully used over many years to manage and access data of multiple hospitals.

  • @neuInfo: accessing heterogeneous data through semantic annotation

    @neuInfo enables access to clinical and epidemiological data distributed in public and project-specific protected databases. @neuInfo provides complex data querying and mediation functionality for the @neurIST Grid infrastructure. Different data sources can be searched using a query user interface providing a semantically unified view on an abstract data model that is linked to the actual physical data sources, allowing direct navigation through application specific knowledge domains like risk assessment, epidemiology and case history. @neuInfo is based on Web service, semantic, and data integration technologies and has been intensively used over several years to connect multiple hospitals in Europe.

Contact