Dutch Seminar on Data Systems Design

The seminar will be held via Zoom. The link will be sent separately in a following email.

24 Nov 2023 from 3:30 p.m. to 24 Nov 2023 5 p.m. CET (GMT+0100)

I am pleased to announce a new edition of our DSDSD series, introducing two speakers who will deliver the upcoming talks. Next Friday, we will have the opportunity to host two young, talented engineers coming directly from our CWI pipeline.  The event will take place on Friday, November 24th 2023, from 3:30pm to 5:00pm CET, featuring talks by Tania Bogatsch and Thomas Glas.

The seminar will be held at CWI and streamed via Zoom. The link will be sent separately in a following e-mail.
Please see below for details on the talks and the speakers.
1st talk
Title: Lambda functions in the duck's nest

Many SQL databases do not focus on efficient LIST-type support. Scalar functions and aggregations on LIST values often require additional unnesting steps or loading normalized data. However, nested input formats such as JSON are widespread in analytics. Efficient operations directly on these input formats can leverage the potential of SQL engines while increasing the system's ease of use. However, using this potential is not trivial, as the LIST type's underlying storage format and operations have to synergize with the relational execution model. DuckDB is a high-performance relational database system for analytics. In this talk, I'll showcase DuckDB's internal design choices to support LISTs efficiently and highlight our support of Python-style list comprehension directly in the SQL dialect.

I studied Computer Science from 2016 to 2022 in Ilmenau, Germany. After my Bachelor's, I got the opportunity for a four-month internship at the CWI in Amsterdam, where I worked on adaptive expression reordering in DuckDB. In 2022, after finishing my studies, I returned to Amsterdam to work for DuckDB Labs as a software engineer.
2nd talk
Title: C3: Compressing Correlated Columns
Open file formats typically uses a set of lightweight compression schemes to compress individual columns, taking advantage of data patterns found within the values of each column. However, by compressing columns separately, we do not consider correlations that may exist between columns that may allow us to compress more effectively. Real-world datasets exhibit many such column correlations and research how they can be exploited for compression.
In this talk, we introduce C3 (Compressing Correlated Columns), a new compression framework which can exploit correlations between columns for compression. We designed C3 on top of typical lightweight compression infrastructure,  and added six new multi-column compression schemes which exploit correlations. We designed our multi-column compression schemes based on correlations we found in real-world datasets, but new compression schemes exploiting other types of correlations can easily be added. C3 uses a sampling-based algorithm to choose the most effective scheme to compress each column. We evaluated the effectiveness of C3 on the Public BI benchmark, containing real-world datasets, and achieved around 20% higher compression ratios compared to using only typical single-column compression schemes.

Thomas Glas is pursuing his master's degree in computer science at the Technical University of Munich. He joined the Database Architectures Group at CWI in May this year to write his master's thesis on columnar data compression.
We look forward to seeing you all during next session!

dsdsd-list mailing list