PhD defence Dean De Leo (Database Architectures)

Everyone is invited to attend the public defence of Dean De Leo of his thesis entitled 'Analysing Dynamic Graphs on Sparse Arrays'.

When
26 Sep 2022 from 3:45 p.m. to 26 Sep 2022 4:45 p.m. CEST (GMT+0200)
Web
Add

Everyone is invited to attend the public defence of Dean De Leo of his thesis entitled 'Analysing Dynamic Graphs on Sparse Arrays'.

Promotores: prof.dr. P.A. Boncz (CWI, VU) and dr. Hannes Mühleisen (CWI)

Link to attend the defence: https://vu.nl/en/events/2022/phd-defence-d-de-leo

Efficient graph analytics is challenging due to the irregular structures of graphs, and becomes even more complex as updates happen during the graph analysis. Addressing that scenario is the topic of this thesis.

This research was performed by Dean de Leo in the Database Architectures group of CWI and advised by Peter Boncz and Hannes Mühleisen. It studies the properties of a data structure called Packed Memory Arrays, proposes a new variant called Rewired Memory Arrays and used such data structures in a new open-source transactional graph analytics system called Teseo. Dean de Leo now works for the graph database company Relational.ai

The committee is formed by dr. Semih Salihiglu (associate professor at University of Waterloo), dr. Petra Selmer (team-lead graph query languages at neo4j), prof. George Fletcher (TU/e), prof. Stefan Manegold (Leiden University and CWI) and prof. Henri  Bal (VU) and will be presided over by prof. Jaap Heringa (VU).

Hope to see you there!


== Analysis of Dynamic Graphs on Sparse Arrays - PhD thesis Dean de Leo ==

Frameworks optimised for graph analysis tend to rely on data structures that are write unfriendly, often assuming that the examined graph is static and immutable.

Traditionally, small transactions and fine-grained updates have been primarily served by different DBMSes, explicitly designed for on-line transaction processing (OLTP). At a later stage, data would be extracted from these systems and loaded in a graph framework designed for analytical queries (OLAP). This pipeline, however, is not adequate for time-sensitive scenarios that require to analyse dynamic graphs, where their components (vertices, edges) and properties can be continuously altered in realtime. These scenarios are commonly referred to as hybrid or Hybrid Transactional Analytical Processing (HTAP) workloads and are a general open problem in research: the same techniques and data structures that make systems so effective in analytical queries are often the main culprit that render them inefficient in fine-grained updates.

In the thesis, we investigate a data structure, the Packed Memory Array (PMA), also named sparse array, that promises to bridge the performance of analytical queries, by enabling fast sequential scans, with a potential high throughput in updates. The PMA is an existing data structure, but it has been subject, up to now, to mostly theoretical work. In our first experiments, we discover that the PMA, in practice, suffers a variety of structural problems than what the developed theory implies. We present a new data structure, named the Rewired Memory Array (RMA), derived from the PMA, aimed at fixing its issues by enhancing or inventing a new set of techniques. Eventually, we prove that the RMA is competitive to hand-tuned B+ trees in terms of fine-grained updates and it can reach close performance in (range-) scans to static dense arrays, in both sequential and parallel environments.

From the RMA, we develop a new system explicitly aimed at the context of graph analysis. Computations on graphs tend to feature a number of challenges, caused by frequent random memory jumps, a high memory-to-computation ratio and partitioning issues. The Compressed Sparse Row (CSR) is an established efficient representation to operate on graphs, already adopted by a number of frameworks, but it is read-only: any change to the graph demands rebuilding the whole data structure. Most of the related work, instead, that explicitly targets graph analysis on dynamic graphs, is based on a common design, which physically mimics the same structure of the adjacency list model. We note, however, that these systems exhibit several functional limitations, related to their capabilities and to the consistency and isolation guarantees, and non functional limitations, related to update skew and missed opportunities in the access patterns of scans. We present a new system, named Teseo, to analyse dynamic graphs. Teseo adopts a design based on fat trees, B+ trees with very large leaves, where the leaves are sparse arrays, and offers full transactional
support. We show that Teseo can overcome most of the functional constrains of existing systems to analyse dynamic graphs, while achieving comparable performance, depending on the access patterns, to either the CSR or, in the worst-case scenario, to the other existing systems.

Sparse arrays and fat trees also represent two new building blocks to make generic systems equally capable of tackling OLAP and OLTP workloads. Still, column stores, widely used in analytical systems, exploit additional techniques, notably vertical partitioning and compression, that can further improve their efficiency in scans at the cost of updates. While these techniques can be straightforwardly mapped to the data structures presented in the thesis, they would equally undermine their ability in updates and, therefore, serve as the main points for future work