MSc projects at Motherduck

There is a close collaboration between the Database Architectures research group and MotherDuck: MotherDuck’s Amsterdam office started at CWI. Information about Motherduck and a list of MSc projects.

Share this page

MotherDuck is a startup in the USA and Amsterdam, building a cloud service around the embedded analytics database system DuckDB. Because the database system is embedded, every client application has a local DuckDB, but MotherDuck can enhance this by starting DuckDB instances in the cloud as well. The service aims to be the first personal Cloud Warehouse software, allowing analysts to make the most out of the Cloud while enjoying the interactivity of their local laptop. The novel concept is “hybrid query execution”, where queries can partially be executed on the client and partly in the cloud.

You will be working at MotherDuck Amsterdam, with Boaz Leskes as direct advisor and Peter Boncz as VU advisor. With all these topics you can expect to gain insights into how modern SaaS platforms are built, get hands-on experience with complicated query execution plans, and explore how to effectively test such a complex environment.

M1: Adaptive Data-Clustering in DuckLake

DuckLake is the new DuckDB data lake format that stores table data in blob storage while keeping the metadata in an ACID-compliant database rather than scattering metadata files across blob storage like traditional lake formats. This architectural choice gives DuckLake transactional guarantees and makes metadata operations much faster.

Currently, DuckLake supports table partitioning on user-specified columns, which allows DuckDB to prune entire files during query execution when the partition keys don't match the query predicates. The problem is that users have to manually choose these partition columns, and making the wrong choice means missing out on significant performance improvements.

The goal of this thesis is to make DuckLake's storage layout adaptive by mining historical query workloads to automatically discover effective partitioning and clustering strategies. The idea is to analyze which predicates appear frequently in queries - which columns get filtered on, which combinations of columns appear together in WHERE clauses, what the selectivity patterns look like - and use this information to automatically apply better physical data organization. This creates a feedback loop where the storage layer gets smarter over time as it learns from actual access patterns rather than relying on upfront human intuition.

M2: Adapt DuckDB’s storage format to be more S3 friendly

DuckDB has made its fame as being a super fast analytics compute engine. DuckDB can read many formats, both local and from S3, but its native storage format is optimized for local and fast SSDs. That said, modern Cloud Data Warehouses, just like Motherduck, separate their storage layers from their compute. The separation allows scaling storage independently of compute and store greater volumes of data on cheaper mediums like S3. S3 is cheaper but it also has significantly different read properties from local SSDs - reads and writes are slower, and small IO operations are extremely inefficient.

The goal of this thesis is to change DuckDB default storage format to make it more friendly to slower but more scalable storage. For example, one could make sure that all metadata blocks are grouped together so they can be read in one read operation instead of multiple. Metadata is often read before reading the actual data which allows for quick pruning. DuckDB has these metadata blocks spread around its file as small reads are fast, but getting these from S3 can be expensive. As part of this thesis you will become familiar with the details of how DuckDB works, it’s pluggable storage sub system and become familiar with S3 - the foundation of pretty much any cloud storage system.

M3: Safe Python UDFs in MotherDuck

When executing SQL queries in MotherDuck, these queries get dynamically split between the local DuckDB instance and the MotherDuck server. When a query references a UDF that has been registered in a python script on the client-side, all data has to be transferred to the client to use this UDF, because MotherDuck currently does not support server-side UDF execution. As a result, this significantly limits the performance benefits of server-side execution.

This thesis aims to develop a UDF system that works seamlessly across both execution environments. Users should be able to define a UDF locally once and have it execute securely wherever the query optimizer determines it should run. This requires addressing both the sandboxing requirements for safe cloud execution and the usability requirements for a developer experience that allow users to declare their UDFs in Python like they would do normally in vanilla DuckDB.

M4: Query-aware DuckDB memory allocator

DuckDB's design as an embedded database means it returns memory to the operating system after query processing completes. This behavior is essential for embedded use cases, as applications that embed DuckDB need to reclaim memory for their own operations, and holding onto unused memory would be antisocial for a library component.

However, MotherDuck uses DuckDB differently. On MotherDuck servers, DuckDB runs as the primary process with dedicated resources rather than sharing memory with a host application. In this context, the embedded memory management strategy becomes counterproductive. Repeatedly releasing memory to the OS after each query and then immediately reallocating it for the next query creates significant overhead from unnecessary system calls and page table manipulation.

The goal of this project is to develop a DuckDB extension that adapts the memory allocator for server environments. Instead of returning memory to the operating system between queries, the allocator should maintain a memory pool that can be reused across query executions, reducing allocation overhead.

MSc projects at Motherduck

MSc projects

Cookies