In order to improve parallel processing of big data, the project ‘Evolutionary changes in Distributed Analysis’ (ECiDA) was granted by NWO, within their Commit2Data programme. The research partners in this new project – CWI being one of them – will develop a platform to support ‘evolving data-intensive application pipelines’, making big data analysis safer, better, faster and adaptable in real-time. Commit2Data is a national public-private research and innovation programme focusing on data science and technology relevant for the Dutch top sectors, to maintain and strengthen the Dutch top 5 position in Big Data.
Distributed server clusters are often used effectively to perform data analysis on voluminous collections of data. These clusters substantially speed up large-scale data analysis, by dividing data collections among available machines, where they can be processed in parallel. Processing of these data typically takes place in pipelines. A pipeline consists of a collection of smaller, concrete analysis steps, each implemented as an independent piece of software, executing sequentially or in parallel with each other, that produce and consume each other’s partial analysis results. For instance, the distributed data processing platform Spark has become a de-facto standard in the world of large-scale data processing. The data processing pipelines for such platforms are composed during design time and then submitted to the central “master” component who then distributes the code among several worker nodes.
In many practical situations, the analysis application is not static and evolves over time: the developers add new processing steps, data scientists adjust parameters of their algorithm, and quality assurance discovers new bugs. Currently, an update of a pipeline proceeds as follows: the developers patch their code, re-submit the updated version, and finally restart the entire pipeline. However, restarting a processing pipeline safely is difficult: the intermediate state is lost and needs to be re-computed; some data need to be reprocessed and, finally, the cost of restarting may not be trivial --- especially for real-time streaming components that require 24x7 availability.
Alexander Lazovik of the Rijksuniversiteit Groningen is the PI of the ECiDA project, Farhad Arbab of CWI and Leiden University is the Co-PI, with TNO Groningen, Vitens N.V., and Anchormen Datascience & AI B.V. as private partners. They will develop a platform to support evolving data-intensive application pipelines without the need for restarting them when the requirements change, e.g., as new data sources or algorithms become available, or partial analysis results trigger modification of processing pipelines. The industrial partners in this project represent three top sectors of water treatment, life sciences, and HTSM/Smart Industry, who provide different industrial case studies in the context of which ECiDA partners will apply the tools and techniques developed in this project, and evaluate their effectiveness.