Martin Koehler

Research Associate

School of Computer Science, University of Manchester

Biography

I am a research associate in the School of Computer Science at the University of Manchester where I work with Norman Paton and Alvaro Fernandes. I received my PhD from University of Vienna under the guidance of Siegfried Benkner. My main research interests include data integration and data management at scale, autonomic computing, big data and cloud computing. I am working in the EPSRC project VADA (Value added data systems - principles and architectures) on new methods for data wrangling and integration.

Interests

Data Management and Integration at Scale
(Distributed) Database Systems
Big Data and Cloud Computing
Autonomic Computing

Education

PhD in Computer Science, with distinction, 2012

University of Vienna
MSc in Economics and Computer Science, with distinction, 2006

Technical University of Vienna
BSc in Economics and Computer Science, 2004

Technical University of Vienna

SOURCERY: User Driven Multi-Criteria Source Selection

Data scientists are usually interested in a subset of sources with properties that are most aligned to intended data use. The SOURCERY system supports interactive multi-criteria user-driven source selection. SOURCERY allows a user to identify criteria they consider of importance and indicate their relative importance, and seeks a source selection result aligned to the user-supplied criteria preferences. The user is given an overview of the properties of the sources that are selected along with visual analyses contextualizing the result in relation to what is theoretically possible and what is possible given the set of available sources. The system also enables a user to interactively perform iterative fine-tuning to explore how changes to preferences may impact results.

E. Abel, J. Keane, N. W. Paton, A. A. A. Fernandes, M. Koehler, N. Konstantinou, N. A. Azuan, S. M. Embury

In Proceedings of CIKM, 2018

Details

User driven multi-criteria source selection

Source selection is the problem of identifying a subset of available data sources that best meet a user’s needs. In this paper we propose a user-driven approach to source selection that seeks to identify sources that are most fit for purpose. The approach employs a decision support methodology to take account of a user’s context, to allow end users to tune their preferences by specifying the relative importance between different criteria, looking to find a trade-off solution aligned with his/her preferences. The approach is extensible to incorporate diverse criteria, not drawn from a fixed set, and solutions can use a subset of the data from each selected source, rather than require that sources are used in their entirety or not at all.

E. Abel, J. Keane, N. W. Paton, A. A. A. Fernandes, M. Koehler, N. Konstantinou, J. C. Cortes Rios, N. A. Azuan, S. M. Embury

In Information Sciences, Elsevier, 2018

Details PDF

Data Context Informed Data Wrangling

In this paper, we define a methodology to fully automate an end-to-end data wrangling process incorporating data context, which associates portions of a target schema with potentially spurious extensional data of types that are commonly available. Instance-based evidence together with data profiling paves the way to inform automation in several steps within the wrangling process, specifically, matching, mapping validation, value format transformation, and data repair. The approach is evaluated with real estate data showing substantial improvements in the results of automated wrangling.

M. Koehler, A. Bogatu, C. Civili, N. Konstantinou, E. Abel, A. A. A. Fernandes, J. Keane, L. Libkin, N. W. Paton

In IEEE International Conference on Big Data (IEEE Big Data 2017) , 2017

Details PDF

The VADA Architecture for Cost-Effective Data Wrangling

In this paper, we present an architecture that supports a complete data wrangling lifecycle, orchestrates components dynamically, builds on automation wherever possible, is informed by whatever data is available, refines automatically produced results in the light of feedback, takes into account the user’s priorities, and supports data scientists with diverse skill sets.

N. Konstantinou, M. Koehler, E. Abel, C. Civili, B. Neumayr, E. Sallinger, A.A.A. Fernandes, G. Gottlob, J.A. Keane, L. Libkin, N.W. Paton

In SIGMOD’17

Details PDF Video Project

A containerized analytics framework for data and compute-intensive pipeline applications

The centerpiece is a framework of containerized execution units and management thereof for satisfying the diverse requirements of data analytics pipelines and its stages. Containers not only ease distribution and deployment of applications, but, more importantly enable an efficient synthesis of different stage implementation variants aimed towards exploiting heterogeneous computing resources. Consequently, this approach allows the infrastructure to utilize mainstream data and compute-intensive techniques and paradigms to achieve the goal of efficient pipeline execution.

Y. Kaniovskyi, M. Koehler, S. Benkner

In SIGMOD’17 Workshops

Details PDF Project

An adaptive framework for utility-based optimization of scientific applications in the cloud

This paper presents a self-configuring adaptive framework optimizing resource utilization for scientific applications on top of Cloud technologies. The proposed approach relies on the concept of utility, i.e., measuring the usefulness, and leverages the well-established principle from autonomic computing, namely the MAPE-K loop, in order to adaptively configure scientific applications. The proposed framework self-configures the layers by evaluating monitored resources, analyzing their state, and generating an execution plan on a per job basis.

M. Koehler

SpringerOpen Journal of Cloud Computing, 2014

Details PDF

#Big Data in #Austria

This study analyses the extraordinary, innovative potential of Big Data technologies for the Austrian market ranging from managing the data deluge to semantic and cognitive systems. Moreover, the study identifies emerging opportunities arising from the utilization of publicly available data, such as Open Government Data, and company internal data by covering multiple domains.

M. Koehler, M. Meir-Huber

Open access technical report for the Austrian Government, 2014

Details PDF

The VPH-Share Data Management Platform: Enabling Collaborative Data Management for the Virtual Physiological Human Community

In this paper we present the VPH-Share data management platform enabling sharing VPH-relevant datasets within the community on the basis of Cloud technologies. The data management platform aims at supporting the data management life-cycle but starting from already available data.

M. Koehler, R. Knight, S. Benkner, Y. Kaniovskyi, S. Wood

In IEEE SKG’12

Details PDF Project

VCE - A Versatile Cloud Environment for Scientific Applications

In this article we describe the Vienna Cloud Environment (VCE), a service-oriented Cloud infrastructure based on standard virtualization and Web service technologies for the provisioning of scientific applications, data sources, and scientific workflows as virtual appliances that hide the details of the underlying software and hardware infrastructure.

M. Koehler, S. Benkner

In IEEE ICAS’11

Details PDF Project

All Publications

A full list of my publications can be found here:

Projects

VADA: Value added data systems – principles and architecture
Cost-effective data wrangling processes need to ensure that data wrangling steps benefit from automation wherever possible. In this project, I work on the challenge of automating the steps in data wrangling by informing them with data context: data from the domain in which wrangling is taking place.
UK EPRSC project grant, April 2015 - March 2020
Retida - Real-Time Data Analytics in the Mobility Domain
Enabling real-time, large-scale analytical workflows for data-intensive science requires the integration of state-of-the-art technologies from Big Data (NoSQL solutions and data-intensive programming, streaming engines), HPC (parallel programming, multi-/many-core architectures, GPUs, clusters), data analytics (analytical models and algorithms), and workflow management (definition and orchestration). As the principal investigator of the project, I established the system architecture and developed a large-scale and massively parallel storage and execution platform for data pipelines.
FFG-Austrian Research Promotion Agency, July 2014 – December 2016
VPH-Share - Sharing for Healthcare
VPH-Share is building a safe, online facility in which medical simulation developers can produce workflows - chains of processing tasks - to allow raw medical data to be refined into meaningful diagnostic and therapeutic information. I have been the lead architect and developer of the VPH-Share data infostructure, a scalable and distributed data mediation engine transparently integrating data from multiple European hospitals.
European Commission, ICT-2009.5.3-269978, March 2011 - February 2015
@neurIST
A complete IT infrastructure has been developeded for the management and processing of the vast amount of heterogeneous data acquired during diagnosis. I contributed to the design and development of @neuInfo, the @neurIST data infrastructure.
European Commission, IST-2004-027703, January 2006 – March 2010
Other projects
Other H2020, FP7, FP6, ERA-NET and Austrian research projects I contributed to.

Teaching

I am a teaching the following courses at University of Vienna:

VU Cloud Computing, 050127 VU 4.0 SWS, WS17/18, Topic: Scalable storage systems and data-intensive programming. Please contact me via mail or via the online forum.

Past teaching activities:

University of Vienna
University of Applied Sciences Technikum Wien

Big and Linked Data, , 2014, 2015.
Cloud Application Development, 2012 – 2015.
Cloud Infrastructure, 2012 – 2013.

University of Applied Sciences Wiener Neustadt

Cloud Computing and Software as a Service, 2011 – 2015.

University of Applied Sciences St. Poelten

Grid and Cloud Computing, 2012 – 2015.

Master and bachelor theses supervision in the areas of Big Data and Cloud Computing, University of Applied Sciences Technikum, 2015, 2014.

Recent & Upcoming Talks

More Talks

The VADA system: architecture and use cases
Thu, Mar 22, 2018, VADA project summit
Data context informed data wrangling
Fri, Sep 15, 2017, VADA project summit
Demonstration of the VADA system
Fri, Sep 8, 2017, EDBT summer school
Demonstration of the VADA data integration system
Thu, May 26, 2016, VADA project summit
Panel Discussion on Big Data
Thu, Sep 17, 2015, IDC SEE Forum 2015
Big Data: Storage and Analysis
Tue, Jun 16, 2015, Big Data for Business Summer School
RETIDA – Real Time data analytics in the mobility domain,
Wed, Jun 10, 2015, OCG Symposium 2015, Workshop Cloud & Big Data
The Hadoop ecosystem and its application in the mobility domain
Thu, May 21, 2015, IDC Datahub 2015
M. Koehler: Data Science and Big Data: Research Landscape and Impact on the Mobility Domain
Thu, Nov 20, 2014, Salzburger Data Science Symposium
Potential of Big Data for Smart Cities and Urban Transportation
Wed, Nov 19, 2014, Urban Future

Software

Retida - Scalable and massively parallel data pipeline
The core of the Retida system has been developed at the AIT mobility department to support the analysis of different types of trajectory data. I extended the data analytics pipeline framework with support for massively parallel programming paradigms and Big data storage systems and developed the architecture towards real-time analytics of huge data sets. The framework has been successfully used to conduct data analysis for several clients and research projects.
VCE - The Vienna Cloud Environment
VCE is a Cloud framework for exposing virtual appliances as Web services. The VCE follows the Software as a Service (SaaS) model and relies on the concept of virtual appliances to provide a common set of generic interfaces to the user while hiding the details of the underlying software and hardware infrastructure. VCE is the software project leading to my PhD and has been proven to provide a stable core codebase for several European research projects including CPAMMS and VPH-Share. It’s core technology has been deployed and been in service at more than 10 sites in Europe for several years.
VPH-Share data management platform (DMP)
The DMP facilitates data management in the VPH-Share project. The DMP supports the data providers with tools to select data to be exposed, semantically annotate the data and securely provide the data to the VPH community by hosting it in the VPH-Share Cloud environment and exposing it via web service, REST and Linked Data interfaces. The DMP has been successfully used over many years to manage and access data of multiple hospitals.
@neuInfo: accessing heterogeneous data through semantic annotation
@neuInfo enables access to clinical and epidemiological data distributed in public and project-specific protected databases. @neuInfo provides complex data querying and mediation functionality for the @neurIST Grid infrastructure. Different data sources can be searched using a query user interface providing a semantically unified view on an abstract data model that is linked to the actual physical data sources, allowing direct navigation through application specific knowledge domains like risk assessment, epidemiology and case history. @neuInfo is based on Web service, semantic, and data integration technologies and has been intensively used over several years to connect multiple hospitals in Europe.

Contact

koehler.martin@gmail.com
martin.koehler82
Room 2.102, Kilburn Building, University of Manchester, M13 9PL, UK
email for appointment

Martin Koehler

Research Associate

School of Computer Science, University of Manchester

Biography

Interests

Education

Recent Posts

A demonstration of user driven and multi-criteria source selection will be presented at the ACM CIKM conference

Our paper on user driven multi-criteria source selection has been accepted in the Information Sciences Journal, Elsevier

Our paper on data context informed data wrangling has been accepted at IEEE Big Data'17

VADA demo paper accepted at SIGMOD'17

RETIDA architecture paper accepted at BeyondMR@SIGMOD'17

Selected Publications

SOURCERY: User Driven Multi-Criteria Source Selection

User driven multi-criteria source selection

Data Context Informed Data Wrangling

The VADA Architecture for Cost-Effective Data Wrangling

A containerized analytics framework for data and compute-intensive pipeline applications

An adaptive framework for utility-based optimization of scientific applications in the cloud

#Big Data in #Austria

The VPH-Share Data Management Platform: Enabling Collaborative Data Management for the Virtual Physiological Human Community

VCE - A Versatile Cloud Environment for Scientific Applications

All Publications

Projects

Teaching

Recent & Upcoming Talks

Software

Contact