A containerized analytics framework for data and compute-intensive pipeline applications


The joint effort of scientific collaborations and the expanding data market creates demand for high-performance and data-intensive analytics infrastructures that can exploit the potential of heterogeneous multi-core architectures with dynamic and scalable execution environments. Contemporary approaches focus on developing efficient parallel application models, but lack the flexibility of efficiently integrating and utilizing native or accelerator-based code. In this work, we illustrate a novel approach on mending this shortcoming and offering seamless application integration into a highly versatile execution infrastructure. The centerpiece is a framework of containerized execution units and management thereof for satisfying the diverse requirements of data analytics pipelines and its stages. Containers not only ease distribution and deployment of applications, but, more importantly enable an efficient synthesis of different stage implementation variants aimed towards exploiting heterogeneous computing resources. Consequently, this approach allows the infrastructure to utilize mainstream data and compute-intensive techniques and paradigms to achieve the goal of efficient pipeline execution. We present our approach in form of a requirement analysis, a multi-tier architecture description, and deployment scenarios based on our current prototype implementation.

In Proceedings of the 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, ACM.