Integrating data science and relational systems

Data scientists have largely overlooked relational database systems, even though these could greatly help their research. The systems did not gain traction, because combining them with databases proved to be slow and cumbersome. Now CWI researcher Mark Raasveldt bridges that gap. In his thesis, he proposes several novel techniques that make database management systems easier to use and more efficient.

Publication date: 10-06-2020

Data scientists have largely overlooked relational database systems, even though these could greatly help their research. The systems didn't gain traction, because combining them with databases proved to be slow and cumbersome. Now CWI researcher Mark Raasveldt bridges that gap. In his thesis, he proposes several novel techniques that make database management systems easier to use and more efficient.

The database research community has made tremendous strides in developing powerful database engines that allow for efficient analytical query processing. However, these powerful systems have gone largely unused by data scientists. This poor adoption is caused primarily by the state of database-client integration: current methods of combining databases with analytical tools are slow and cumbersome.

Instead, data scientists have opted to re-invent database systems by developing a zoo of data management alternatives. These systems perform similar tasks to classical database management systems, but have many of the problems that were solved in the database field decades ago.

Bridging the gap
CWI researcher Mark Raasveldt investigated how to bridge this gap by making database management systems easier to use and more efficient for these workloads. The aim was facilitating an efficient and smooth integration of analytical tools and relational database management systems. Raasveldt’s research focused on the three primary methods for database-client integration: client-server connections, in-database processing and embedding the database inside the client application.

Novel techniques
Raasveldt proposes several novel techniques that improve upon the state-of-the-art. In his thesis, he demonstrates a new client-server protocol that is optimized for bulk-transfer of large data sets. This allows for more efficient large-scale data analysis when using remote servers. He also showcases so-called vectorized user-defined functions, that improve in-database processing efficiency through vectorized execution.

Furthermore, Raasveldt describes MonetDBLite, an embedded version of the MonetDB database system that was also developed at CWI. MonetDBLite was developed for efficient integration with Python and R. “The techniques that we propose have all been integrated and tested in real database systems” says Raasveldt. “This has demonstrated that our solutions are not just theoretical, but practically applicable as well.”

DuckDB
In his thesis, Raasveldt also showcases DuckDB, a new data management system purpose-built for efficient and painless integration with Python and R (and other analytical tools). Raasveldt: “DuckDB incorporates all the lessons that we have learned investigating database-client integrations and create an easy-to-use and highly efficient embedded database.”

Raasveldt defended his PhD thesis Integrating Analytics with Relational Databases at the Leiden University. He performed his research at CWI’s Database Architectures group, supervised by Stefan Manegold and Hannes Mühleisen. Raasveldt will continue his career at CWI as a postdoc, working on the further development of DuckDB.

In 2018 Raasveldt won the SIGMOD student research competition for his work on embedded databases.