MSc Projects at CWI Amsterdam
We offer internal MSc projects with the Database Architectures group at CWI, located at Amsterdam Science Park.
Graph-related (DuckPGQ - Daniel)
Parsing a query is the first step in the process of executing a query. Since every query has to be parsed, this is a crucial step in the process that needs to be efficiently executed. DuckDB is in the process of switching from the previously adopted and extended PostgreSQL parser to its own custom run-time extensible parser [1].
This new parser is a PEG-style parser, using recursive descent and backtracking to construct a concrete syntax tree which is then transformed into DuckDB’s own logical query plan. The initial phase of this research will involve profiling the existing parser to identify which specific SQL constructs cause the most backtracking. This analysis will provide a solid baseline and pinpoint the areas that would benefit most from optimization.
Due to backtracking, the PEG parser has a potentially exponential time performance in the worst case. Converting the parser to a packrat parser would guarantee it runs in linear time by using memoization, where the intermediate result of each grammar rule at a specific point is stored. This introduces a fundamental trade-off between execution time and memory usage.
The goal of this thesis is to adapt the PEG parser into a packrat parser. The investigation will focus on quantifying the trade-off between improved parsing time and increased memory consumption across various workloads, with the aim of minimizing the performance gap with the legacy parser while managing memory overhead. This leads to several key research questions: Can a policy of selective memoization, guided by the initial profiling data, yield most of the performance benefits for a fraction of the memory cost? For memory-intensive workloads, can cache eviction strategies be applied to the memoization table to maintain performance while respecting a memory budget?
This project will be advised by Peter Boncz (boncz@cwi.nl) & Daniel ten Wolde (dljtw@cwi.nl) at CWI.
SQL/PGQ, part of the SQL:2023 standard, introduced a visual graph syntax to enable easier pattern matching and path-finding on property graph data stored in relational databases.
DuckPGQ was one of the first implementations of the SQL/PGQ syntax as a popular community extension in DuckDB. However, the current implementation carries significant technical debt because it directly modifies DuckDB's core PostgreSQL-derived parser. This approach is brittle and creates a high maintenance burden, as upstream changes to the parser can break the extension. Furthermore, since the parser is modified at compile-time, it prevents modularity and makes it impossible for other extensions to add their own custom syntax.
DuckDB’s new PEG parser solves this by being run-time extensible, allowing extensions to register their own grammar rules cleanly and independently.
The goal of this thesis is to migrate the DuckPGQ extension to this modern parsing framework. This involves three key objectives:
- Implement the complete SQL/PGQ grammar using the new PEG-style syntax.
- Rework the DuckPGQ extension to remove its dependency on the legacy parser and use the new run-time extensible parsing API.
- Evaluate the new implementation against the original by:
- Analyzing parsing performance to quantify any speed differences.
- Comparing the quality of syntax error messages to improve user experience.
Incremental view maintenance (OpenIVM - Ilaria)
Incremental View Maintenance (IVM) is about updating materialised views efficiently by applying just the new changes, rather than recomputing everything. DuckDB has an IVM extension (OpenIVM) that currently handles only basic cases (simple aggregations and some support for inner joins).
This project will extend DuckDB’s IVM capabilities to more complex SQL operations, making it more powerful and practical. You will implement support for features such as outer joins, window functions, DISTINCT aggregations, and perhaps even recursive queries in the OpenIVM module. This means when base table data changes, the view can be updated correctly for these advanced features without full recomputation. In tandem with adding capabilities, you will build a robust testing framework to validate that the incremental updates always match the results of a fresh query. This involves creating numerous test scenarios (e.g. different types of data updates, edge-case queries) and possibly automated test tools to ensure the IVM extension behaves correctly and efficiently. You may draw inspiration from recent research like DBSP (a framework for incremental computation) to guide your design, bringing DuckDB’s IVM in line with state-of-the-art approaches. The outcome will be an extension that significantly broadens DuckDB’s support for incremental view maintenance, making it a more attractive engine for real-world applications that rely on materialised views.
This project is implementation-heavy. It will involve a substantial amount of programming in C++ as you enhance the DuckDB engine. You’ll be crafting new mechanisms within the database optimizer to handle the incremental logic for the new operators. Careful attention to performance is important: your implementation should minimise overhead so that updating a view is much faster than recomputing it. You will also design and implement comprehensive tests, which is an excellent opportunity to learn good software engineering practices in a database context (ensuring reliability and correctness of complex systems). By completing this project, you will gain expertise in database internals, particularly in incremental processing techniques and benchmarking. It’s a hands-on systems project that will strengthen your coding skills and your understanding of how query processors can be made dynamic and efficient.
Expected Outcome: A DuckDB extension that supports advanced IVM operators (e.g. outer joins, window functions) and includes a comprehensive testing framework.
This project will be co-advised by Peter Boncz and Ilaria Battiston at CWI.
This project explores adaptive optimisation techniques to make IVM even smarter. The key idea is to allow DuckDB to adaptively decide how it maintains a materialised view based on context, such as the size of updates or system workload. For example, sometimes it might be faster to recompute a view from scratch (if a huge portion of the data changed) rather than apply a very large incremental difference. Or perhaps only some intermediate results should be materialised and reused, while others are recalculated each time. In this project, you will investigate strategies for adaptive materialisation: deciding when and what to cache or materialise during view maintenance. You might develop a cost model that the system uses at runtime to choose between incremental update vs full recompute, or between different ways of partial recomputation. You will prototype enhancements to DuckDB’s IVM module that implement these decisions automatically. This could involve tracking statistics about updates, evaluating different plan options, or even learning from past runs. The outcome will be a smarter IVM system that is able to adaptively switch materialization techniques and dynamically compile different incremental instructions based on the behaviour of the system.
This project is research-heavy. You will be investigating open questions in the realm of adaptive query processing. The work will involve reading academic papers on incremental view maintenance and adaptive query optimisation, formulating hypotheses, and possibly inventing new methods for DuckDB to make optimisation decisions on the fly. You’ll conduct experiments to compare strategies (e.g. measure when incremental vs full recomputation is beneficial, or how much time is saved by adaptive materialisation of intermediate results). Success in this project means you’ll not only implement a proof-of-concept solution, but also derive insights that could be of interest to the broader database community. By the end, you’ll have deepened your understanding of database optimisers and gained experience in research methodology.
Expected Outcome: A prototype implementation of adaptive materialisation strategies in DuckDB’s IVM engine, with experimental evaluation. The project may result in a workshop paper on cost-based adaptive view maintenance.
Co-advised by Peter Boncz and Ilaria Battiston at CWI.
Modern query engines like DuckDB use internal logical plans - tree-like structures of operators such as joins, aggregates, and projections - to represent queries. This project tackles the non-trivial challenge of converting a logical plan back into an equivalent SQL query. You will explore how DuckDB’s query optimiser reshapes queries (for example, by unnesting subqueries or reordering joins) and develop techniques to generate SQL strings that faithfully represent even the most complex optimised plans, in a human-readable way. This will involve working with DuckDB’s internal logical operator trees and transforming them into Abstract Syntax Trees (ASTs), providing you with hands-on experience in query compiler design. The goal is to build a DuckDB extension in C++ that, given any logical plan (including tricky cases like subqueries, window functions, or recursive queries), produces a correct and human-readable SQL query. A successful outcome will not only be a useful tool for the database community (for tasks such as incremental view maintenance or debugging the query optimiser) but will also give you deep insight into the inner workings of database systems.
This project is implementation-heavy. You will be expected to design and write high-quality C++ code, integrating it with DuckDB’s codebase. The work includes creating new algorithms to traverse and translate AST nodes into SQL text, handling edge cases in SQL syntax, and thoroughly testing your extension on a wide variety of queries to ensure correctness. Through this process, you’ll gain strong compiler development skills and a deep understanding of query optimisers. By the end of the project, you will have become proficient in database internals and compiler techniques.
Expected Outcome: A DuckDB extension that reconstructs SQL queries from logical plans through an AST, including support for complex constructs like subqueries and window functions.
Co-advised by Peter Boncz and Ilaria Battiston at CWI.
Materialised views are a powerful optimisation tool in databases, allowing precomputed query results to be reused for faster execution. However, their full potential is only realised when the system can automatically detect when a query can be answered using an existing view, a process known as view matching. This project explores how to implement view matching in DuckDB, either as a standalone extension or as an enhancement to the existing IVM module. The goal is to develop a mechanism that analyses incoming queries and determines whether they can be rewritten to use a materialised view, even if the query is phrased differently or includes additional filters or projections. This involves reasoning about query equivalence, containment, and rewriting strategies. You will investigate how to represent and compare query plans, how to match subqueries to views, and how to rewrite queries to use views without changing semantics. The project will also explore how to integrate view matching with DuckDB’s optimiser, so that the system can choose between using a view or computing the result from scratch based on cost. This work will make DuckDB smarter and more efficient, especially in workloads with repeated or overlapping queries.
This project is research-heavy. You will be exploring theoretical concepts such as query containment, equivalence, and rewriting, and applying them in a practical system. The challenge lies in modelling these ideas within DuckDB’s architecture and designing algorithms that can perform view matching efficiently. You’ll prototype your ideas, test them on realistic workloads, and evaluate how much performance improvement they offer. By the end of the project, you will have gained deep knowledge of query rewriting, optimisation strategies, and the semantics of SQL, positioning you well for further research or advanced engineering roles in database systems.
Expected Outcome: A DuckDB extension that performs view matching and query rewriting based on materialised views. The project may lead to a workshop paper.
Co-advised by Peter Boncz and Ilaria Battiston at CWI.
Secure & Private Data Management (Lotte / Ilaria)
Data privacy is a critical concern today, and Differential Privacy (DP) offers a principled way to protect individual information in statistical results. This project involves adding differential privacy features to DuckDB through a new extension. In practical terms, you will enable DuckDB to answer SQL queries with a guarantee that individual contributions are masked by (user-defined) noise. For example, a query counting people in a dataset might have a tiny random offset added to the result, ensuring that no single record’s presence or absence can be detected. You will design a system where users can specify a privacy budget (an epsilon value) for their queries or a session, and DuckDB will enforce this by injecting noise and tracking usage of the budget. Accomplishing this requires modifying DuckDB’s parser, planner, and optimiser: the parser might accept new syntax or configuration for privacy, the planner/optimiser will need to insert the appropriate noise-generation operators into the query plan, and ensure that combining multiple queries still respects the overall privacy budget. Importantly, you won’t be inventing new noise algorithms from scratch (you can leverage existing DP libraries for the mathematical parts). The challenge is in the integration: making these mechanisms work within a relational database engine smoothly and efficiently. Through this project, DuckDB would become one of the few databases with out-of-the-box support for differential privacy, which is a cutting-edge feature in industry.
This project is research-heavy in the sense of design and analysis. The emphasis is on understanding how to model and enforce privacy in a database context. You will explore questions like how different query types propagate privacy loss, how to integrate the semantics of differential privacy in a relational engine, and how to balance accuracy with privacy by tuning parameters. The work involves a mix of implementation and careful reasoning: you’ll implement the extension (hooking into DuckDB’s engine to add noise and track privacy budgets), and then study the behaviour of this system. How much error does the noise introduce for various queries? Is the privacy guarantee truly maintained in complex query sequences? Answering these will deepen your understanding of both privacy theory and database systems. The skills you gain include working with privacy frameworks and modifying a database engine’s core component.
Expected Outcome: A DuckDB extension that integrates differential privacy mechanisms into the parser, planner, and optimiser.
Co-advised by Peter Boncz and Ilaria Battiston at CWI.
SIDRA is a decentralised, streaming architecture that enables privacy-preserving analytics by maintaining materialised views across Personal Data Stores (PDSs), a staging area, and central servers. Because data arrives incrementally and asynchronously from many clients, SIDRA behaves like a stream processing system, where updates are processed continuously and must respect privacy constraints before being materialised centrally. In this context, join operations become particularly challenging. Traditional join algorithms assume that both inputs are available in full, but in streaming systems, data arrives over time. Symmetric hash joins are a class of algorithms designed for such settings: they maintain hash tables for both inputs and process tuples as they arrive, emitting join results incrementally. This project investigates how to implement symmetric joins in DuckDB to support SIDRA’s streaming and decentralised model. You will explore how to buffer and match tuples from different PDSs, handle late arrivals and TTL expiry, and ensure that join results meet SIDRA’s privacy constraints (e.g. minimum aggregation). The goal is to build a join operator that works efficiently in a stream-like, privacy-aware environment.
This project is implementation-heavy. You will design and implement new join strategies within DuckDB, focusing on incremental and symmetric processing. This includes writing efficient C++ code, managing memory and buffering, and integrating with SIDRA’s metadata and privacy enforcement mechanisms. Through this work, you will gain expertise in join algorithms, streaming query execution, and privacy-aware data processing.
Expected Outcome: A DuckDB extension implementing symmetric join algorithms for streaming, decentralised execution in SIDRA. The project may result in a workshop paper.
Co-advised by Peter Boncz and Ilaria Battiston at CWI.
SIDRA is a decentralised data architecture that enables privacy-preserving analytics by keeping raw data in Personal Data Stores (PDSs) and only transmitting aggregated, privacy-compliant results to central servers. It uses a declarative SQL-based framework to define data flows and constraints, and compiles these into incremental view maintenance (IVM) pipelines across clients, staging areas, and central views.
This project explores how to enhance the SIDRA compiler with “smart compilation” techniques that optimise query execution by reducing unnecessary data transmission. For example, in Top-K queries, the compiler could generate logic that only transmits results from a PDS if they are likely to be in the global Top-K. Similarly, for queries with WHERE clauses, the compiler could push down predicates to filter data locally. Other strategies include window-aware pruning (ignoring updates outside the current query window) and privacy-aware rewriting (modifying queries to avoid sensitive columns or enforce minimum aggregation constraints). The goal is to make SIDRA’s compilation process query-aware and privacy-conscious, improving performance and reducing central load.
This project is implementation-heavy. You will extend the SIDRA compiler to analyse query semantics and generate optimised SQL scripts tailored to each query type. This involves designing rule-based or heuristic transformations, modifying the compiler pipeline, and validating the improvements through benchmarks. By the end of the project, you will have gained deep expertise in query analysis, compiler design, and privacy-aware optimisation, making you well-equipped for advanced work in database systems and secure data processing.
Expected Outcome: A SIDRA compiler extension that generates query-aware, privacy-conscious SQL scripts optimised for decentralised execution. The project may result in a workshop paper.
This project will be co-advised by Peter Boncz and Ilaria Battiston at CWI.
FastLanes is a novel columnar file-format invented at CWI, currently under active development. Until now, FastLanes does not incorporate any security features yet, but we are planning to implement vectorized encryption into FastLanes. However, Since Machine Learning (ML) workloads are increasingly important, there is an interest in file-formats that work seamlessly on both CPUs and GPUs. More specifically, this means that preprocessing steps such as decryption and decompression are pushed onto the GPU. FastLanes is currently working on making its file-format more compatible with the GPU, by e.g. keeping its data in cache after decompression to directly operate on it, to avoid unnecessary passes to global (main) memory. We want to propose a similar mechanism for encryption; first decrypt the data and then directly decompress it, while keep it in registers or shared memory. However, due to e.g. the limited memory capacity of GPUs and the required 128-bit block size of AES, this is not straightforward. In this project, we aim to answer the following questions:
- How to implement encryption algorithms (e.g. AES-GCM, CTR, or AES-OCB), leveraging the SIMT parallelism of the GPU?
- Can we leverage the synergy between decryption and decompression on GPUs to improve GPU data processing, and how to integrate this in the FastLanes file format?
As a first step, we want to implement and benchmark different encryption algorithms, such as AES counter (CTR) mode and maybe AES offset codebook (OCB) mode on the GPU. Following, we are interested in implementing en/decryption on top of FastLanes’ Bitpacking, Delta, RLE and ALP, to test the synergy between decrypting and decompressing in one go, to speedup GPU query processing.
Objectives
- Benchmark encryption algorithms on the GPU, and compare the performance against CPU-based encryption
- Determine the optimal granularity of decrypting and decompressing in one pass on the GPU (e.g. small vectors, large vectors or even blocks)
- Investigate how to best integrate vectorized encryption on GPUs in the FastLanes file format
The expected results of this project are:
An efficient implementation of one or multiple encryption algorithms that leverages the single instruction multiple data (SIMT) parallelism on the GPU and the integration of vectorized encryption on the GPU in the FastLanes file format.
This project will be advised by Peter Boncz (boncz@cwi.nl) and Lotte Felius (felius@cwi.nl) at CWI.
Trusted Execution Environments (TEEs) like Intel SGX, AMD SEV, ARM TrustZone, and others offer secure enclaves for processing sensitive data. While many benchmarks exist for individual TEEs, there is no comprehensive evaluation that compares them across a range of database workloads. This project aims to fill that gap by conducting a systematic benchmarking study of TEE technologies, focusing on hybrid workloads. You will port DuckDB - or potentially another database engine - to run inside multiple TEE platforms, adapting the system as needed to support transactional operations. Then, you will design and execute a suite of benchmarks that measure performance, scalability, and security overhead across different TEEs. This includes evaluating latency, throughput, memory usage, and security guarantees. The goal is to produce a clear comparison of which TEE performs best under which conditions, and what trade-offs are involved. Your results will help guide future development of secure database systems and inform researchers about the strengths and limitations of each TEE.
This project is implementation-heavy. You will be working at the intersection of database engineering and hardware security, adapting database code to run inside various TEEs and tuning it for performance. This involves understanding the constraints of each platform (e.g. memory limits, encryption overhead, syscall restrictions) and modifying the database engine accordingly. You’ll also design and run rigorous benchmarks, analyse the results, and document your findings. By the end of the project, you will have gained hands-on experience with multiple TEE technologies, database internals, and performance analysis, making you highly skilled in secure systems deployment.
Expected Outcome: A benchmark suite and comparative analysis of TEE platforms for OLTP workloads, with a ported DuckDB or alternative engine. The project may lead to a workshop paper.
Co-advised by Peter Boncz and Ilaria Battiston at CWI.
When sensitive data is processed on someone else’s machine (for example, in the cloud), one way to keep it safe is to use a Trusted Execution Environment (TEE) like Intel SGX or AMD SEV. A TEE is a secure enclave that isolates the computation and data from the rest of the system, even if the operating system is compromised. However, running database operations inside a TEE is slower and more cumbersome due to the extra security overhead (encryption, memory access restrictions, etc.). In this project, you will develop a DuckDB extension that is optimised for TEE environments, allowing DuckDB to run queries inside SGX/SEV enclaves with better performance. You’ll investigate the performance challenges specific to TEEs: for instance, Intel SGX enclaves often suffer from more cache misses and have to use specialised memory allocators, and certain operations become expensive due to encryption of memory pages. You will then modify DuckDB’s query execution methods to mitigate these issues. This could include techniques like vectorised encryption (encrypting/decrypting data in batches to exploit CPU instructions), manual loop unrolling or other low-level optimisations to reduce branch and transition overhead, and tuning how memory is used to be enclave-friendly. Additionally, you will examine differences between Intel’s and AMD’s TEE implementations - for example, how they handle larger memory or multiple threads – to ensure your DuckDB extension is portable and efficient on both. You’ll validate your optimised DuckDB enclave version with benchmarks (e.g., TPC-H queries) to quantify the improvements and identify any remaining trade-offs. By the end, you will have created a prototype that brings DuckDB a step closer to being a secure-yet-fast analytic database for cloud environments.
This project is implementation-heavy. You will be working at a low level in the system, writing C++ code that interfaces with hardware-specific SDKs (like Intel SGX SDK) and making fine-grained performance improvements. Expect to spend time profiling the system to pinpoint bottlenecks (e.g., encryption overhead, context-switch costs) and then iteratively improving the code. A strong interest in computer architecture and systems programming will be essential, as you’ll be dealing with how the database engine runs under the hood in a constrained environment. The learning curve is steep, but the reward is experience with cutting-edge security technology and optimisation techniques. By successfully completing this project, you’ll become well-versed in hardware security mechanisms and how to tailor software to make the most of them, making DuckDB one of the few analytic databases that can run effectively out-of-the-box in secure enclaves.
Expected Outcome: A DuckDB extension optimised for Intel SGX and AMD SEV, with performance benchmarks and platform-specific tuning.
Co-advised by Peter Boncz, Lotte Felius, and Ilaria Battiston at CWI.
Vector Search (Leonardo, PDX)
Vector Search is one of the hottest topics in AI and computer science. It is a core component of various applications, including retrieval augmented generation (RAG), recommendation systems (e.g., Spotify’s song recommendations), and information retrieval (e.g., web search engines). This has resulted in the development of vector databases: systems to store, manage and query vectors (e.g., Weaviate, Milvus). Some vendors are betting on cloud-based storage (LanceDB, Turbopuffer). However, these are closed-source and use proprietary data formats. We believe the vector search landscape needs a portable, lightweight, open-source, and easy to adopt solution in the cloud.
On this project we aim to continue the efforts to implement vector search in DuckLake. DuckLake is an open Lakehouse format. DuckLake stores metadata in a catalog database (e.g., DuckDB) and stores data in Parquet files. To make the search fast, we want to use PDX: a novel approach developed at CWI that accelerates similarity search by factors, and doesn’t need heavy weight indexing to achieve SOTA search speed.
The expected deliverables are:
- A working implementation of PDX inside DuckLake (local).
- A comparison of the performance of PDX+DuckLake with another similar system.
- Extended goal: A research publication on a top data systems conference / workshop, as this project has a strong research component.
A good candidate will have: experience in C++ programming.
- You will be advised by Peter Boncz (boncz@cwi.nl) and Leonardo Kuffo (lxkr@cwi.nl) at CWI.
Vector Search is one of the hottest topics in AI and computer science. It is a core component of various applications, including retrieval augmented generation (RAG), recommendation systems (e.g., Spotify’s song recommendations), and information retrieval (e.g., web search engines). The importance of vector search is such that most database systems have implemented vector-based capabilities. At CWI, we have developed PDX: a novel approach developed at CWI that accelerates similarity search by factors. PDX is highly compatible with hardware acceleration on CPUs (SIMD), and we believe it will also benefit significantly from GPU processing.
On this project we aim to implement PDX on modern GPUs (CUDA and Vulkan) and compare it with other Vector Search algorithms on GPUs.
The expected deliverables are:
- A working implementation of PDX on at least 1 GPU architecture.
- A comparison of PDX to other algorithms on GPUs.
- Extended goal: A research publication on a top data systems conference / workshop.
A good candidate will have: experience in C++.
An excellent candidate will have: experience with GPU programming.
-You will be advised by Peter Boncz (boncz@cwi.nl) and Leonardo Kuffo (lxkr@cwi.nl) at CWI.
Behind the scenes the Discover Weekly playlist of Spotify is created as follows: 50M users and 500M songs are represented as vector embeddings. Every week, Spotify runs an Approximate Nearest Neighbor Search (a.k.a. Vector Similarity Search), on every user to find the TOP 30 most similar songs to it. This means that every week, Spotify is performing 50M individual queries on a huge collection of embeddings. We believe these workloads can be optimized by doing a Similarity Join.
A similarity join is analogous to a database join, but here, the condition to relate two rows is based on a similarity metric. Unfortunately, the current research done on similarity joins is not on-par with the research done for individual similarity queries. As such, we question whether we can do better.
On this project: We want to research and develop a new strategy/framework/algorithm to perform similarity joins at a large scale. This project has a strong research component in which we expect the candidate to read previous literature and come up with new ideas.
The expected deliverables are:
- Proposal of a new strategy/framework/algorithm to perform Similarity Join in vector search.
- An implementation of the algorithm in C++.
- A comprehensive comparison with other already existing techniques to do Similarity Joins.
- Extended goal: A research publication on a top data systems conference / workshop.
A good candidate will have: motivation for research, experience in C++.
You will be advised by Peter Boncz (boncz@cwi.nl) and Leonardo Kuffo (lxkr@cwi.nl) at CWI.
Core DB topics (Paul & Pedro)
While many optimizations in storage and query processing focus on data types that are numbers, internally (decimal, float, but also e.g., datetime), numerous workload studies, such as Redset, SnowSet and Get Real, have shown that over half of the columns in tables are varchar (string) – and given that strings are larger on average than numbers, the majority of database data is string.
However, string processing is slow; comparing a string requires a for-loop over the characters, whereas comparing numbers is a single CPU instruction. Also, strings are big this means that the data structures that HashJoin and HashAggregate (GROUP BY) and Sort (ORDER BY) depend on get big, causing more memory usage (scarce!), CPU cache misses, but also disk-spilling (slow!). DuckDB supports two forms of string compression: Dictionary encoding (strings become numbers that point into a dictionary) and FSST, which compresses individual strings into shorter individual strings using a symbol table. In both cases, there is an independent dictionary resp. symbol table for every row-group (100K rows in a column), which complicates query processing.
The main goal is to find a way to also keep FSST strings compressed while inside a hash-table or sort buffer (there are working ideas for how to do this). A stretch goal would be to move DuckDB from FSST to FSST+.
This project meets a strong need for millions of DuckDB users and will have a lot of visibility.
You will be advised by Peter Boncz (boncz@cwi.nl) and Paul Gross (Paul.Gross@cwi.nl)
DuckDB uses the so-called Linear-Chained Hash tables, where hash-collisions are handled with linear hashing (go to next bucket) but duplicates with chaining (a next pointer in the records) [1]. We typically will not follow a bucket pointer for a key collision, as the ‘salt’ hash-signature we store in the (unused) highest bits of the pointer would not match.
If a hash table has many duplicates, traversing these hash buckets is wasteful, because unless the duplicates were inserted sequentially, following the chain will cause expensive CPU cache misses. Besides the next pointer, every hash-table record further contains (repeats) the hash key(s) – which is known to be equal for all elements in the chain. In those cases, it will make sense to reorganize the records [2] in chunks where the keys are the header and which also holds the number of tuples (and possible other aggregates), followed by a dense array of records without keys or next pointers. We need a parallel and CPU efficient algorithm for doing this.
Rather than always using the same approach for hash tables, as in the new chunk-reorganized design for Umbra [2], a vectorized system can more easily use a run-time optimized approach: depending on the key distribution (and value distribution), the hash table could be different. So, we should only reorganize the record in case of many duplicated keys. But we can extend the adaptivity in other dimensions as well. For instance, the choice of hash function (very quick, or very robust). If there are very few tuples, the load factor can be lower, whereas big hash tables better be more fully loaded.
[1] https://www.cwi.nl/~boncz/cidr2025_factorized.pdf
[2] https://db.in.tum.de/~birler/papers/hashtable.pdf
You will be advised by Peter Boncz (boncz@cwi.nl) and Paul Gross (Paul.Gross@cwi.nl)
CSV files are among the most common ways to store data. Although CSV files have a well-defined format, they often do not strictly adhere to it, making it challenging for systems to read them correctly, or even at all.
In this project, you will work on DuckDB's CSV Reader, with the goal of enabling it to self-correct CSV files based on sniffing information collected during binding. You will explore a range of techniques, from heuristics to more advanced data cleaning methods, to further enhance the reader’s robustness while minimizing any impact on performance. Additionally, this project will involve further development of robustness and speed benchmarks for CSV files.
References:
- https://www.youtube.com/watch?v=YrqSp8m7fmk&ab_channel=DSDSD-DutchSeminaronDataSystemsDesign
- https://duckdb.org/docs/data/csv/reading_faulty_csv_files.html
- https://duckdb.org/2023/10/27/csv-sniffer.html
- https://www.vldb.org/pvldb/vol16/p1870-vitagliano.pdf
- https://www.microsoft.com/en-us/research/uploads/prod/2019/04/chunker-sigmod19.pdf
This project will be advised by Pedro Holanda (DuckDB Labs) and Peter Boncz (CWI).
Database systems offer various integral types, each occupying a different number of bytes and capable of holding a specific range of values. In DuckDB, these range from 1-byte integers to 16-byte integers. However, certain projects require storing integers larger than 16 bytes without any data loss (i.e., without storing them as doubles), which is crucial. Inspired by Python and JavaScript, DuckDB recently introduced the foundation for a new data type called the VARINT type. This type allows DuckDB to store integers up to 2^23 bytes. The binary format in which these types are stored is also designed to facilitate easy comparisons between different VARINTs.
In this project, you will begin with a review of the state-of-the-art, examining the V8 and CPython implementations of variable integers as well as Postgres' strategy for storing very large decimals. After this review, you will adapt these techniques to optimize the (currently quadratic) algorithm for casting strings and doubles to VARINT. Later, you will define and implement arithmetic operations on this new data type.
References:
https://github.com/duckdb/duckdb/pull/13015
https://github.com/v8/v8/blob/main/src/bigint/fromstring.cc
https://github.com/python/cpython/issues/90716
https://github.com/python/cpython/blob/19d1e43e43df97d14c5ab415520b6ccd941e1c88/Lib/_pylong.py#L389
https://github.com/v8/v8/blob/main/src/bigint/tostring.cc
This project will be advised by Pedro Holanda (DuckDB Labs) and Peter Boncz (CWI).
The new FastLanes file format (Omid)
FastLanes is a new big data format that aims to provide much faster decompression and better compression than current data formats (parquet, ORC). It also aims to be suitable for GPUs. The key idea is to use completely data-parallel encodings (RLE, DICT, DELTA, FOR), to align with the capabilities of SIMD and SIMT processors. These encodings are simple and would not produce enough compression ratio by themselves. But when recursively applying them, i.e. by trying to encode further the result of encoding1 with encoding 2 (“cascading encodings”) we can achieve state-of-the-art compression ratios. Note that compression in parquet/ORC heavily relies on slow, GPU-unfriendly, general-purpose compression algorithms like Snappy, LZ4 and zstd (we know these algorithms are heavily speed-optimized but they are at least 10x slower than simple data-parallel encodings).
C20: Arrow is a popular in-memory columnar format, and we want a FastLanes decoder that produces data in this format. FastLanes is advancing the state of the art by allowing Partial Decompression: it can generate data in encoded forms by stopping decoding at a particular level of the encoding expression. Arrow supports the REE (Run End Encoding) layout, and we would want the FastLanes RLE and CROSS_RLE operators to directly be decompressed into this. Similarly, the DICT operator could directly decode into the Arrow Dictionary Layout. We already have a FastLanes DuckDB reader, and this reader could make the same Dictionary decoding optimization; and possibly also an optimization of decoding into FOR-vectors (if these would be added in DuckDB, wip). A stretch goal is to integrate support for predicate pushdown into the FastLanes C++ decoder. We would want to push-down column vs constant predicates (and AND/OR/NOT combinations thereof) as well as casts (specifically casts from VARCHAR to other types).
C21: FastLanes will greatly expand the amount of Encoding Expressions it uses (now ~20) to hundreds or even thousands. Given such a large amount of encodings to choose from, the wizard that looks at a row-group sample and has to make the decision which would have the highest compression ratio, will need to stop using exhaustive search. The task of the project is to learn a model that allows the compressor to quickly identify a small number of promising candidate expressions. A stretch goal of this project is for your software to become part of the official C++ FastLanes implementation. This would also entail an overhaul of the current wizard, which unlike the FastLanes decoding engine, has not been optimized at would benefit from a strong speed boost.
https://github.com/cwida/fastlanes (all)
https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf (all)
https://www.vldb.org/pvldb/vol18/p4629-afroozeh.pdf (all)
https://arrow.apache.org/docs/format/Columnar.html (C20) https://homepages.cwi.nl/~boncz/msc/2024-RaufsDunamalijevs.pdf (C20)
https://vldb.org/cidrdb/papers/2020/p4-ghita-cidr20.pdf (C20+C21)
This project will be advised by Omid Afroozeh and Peter Boncz (CWI).