The performance of modern computers improves very fast. Database technology, however, is not yet able to fully profit from these vast hardware improvements. Marcin Zukowski of the Centrum Wiskunde & Informatica (CWI) in Amsterdam designed new methods for data processing, implemented in the MonetDB/X100 system. These methods allow to improve the performance of database systems by a factor 10 or more. Zukowski will defend his PhD thesis ‘Balancing Vectorized Query Execution with Bandwidth-Optimized Storage’ on September 11, 2009 at the University of Amsterdam. His research forms the basis for the CWI spin-off company VectorWise, where Zukowski is currently employed. This company was recently in the news for its collaboration with Ingres Corporation, a leading open-source database vendor.
Modern database systems can be used for, e.g. credit card analysis, studying billions of customer data from large retail chains (data mining), or processing call data at telephone companies. However, databases are not deployed in certain areas because they are too slow. In these cases specialized systems are typically employed. Marcin Zukowski now designed a general method that is not only beneficial to existing database applications, but also, thanks to its greatly improved performance, allows for the use of database technology in new areas. One example is searching for keywords in large amounts of unstructured information, like in? information retrieval.
Bottles and crates
The core component of the system proposed by Zukowski is a new approach to data processing, called “ vectorized in-cache execution model”. This removes overheads in the software when processing data. "Compare it to buying beers for a party," says the CWI researcher. "Someone can get a bottle of beer in the store, put it in the fridge at home and then go to the store again and again, carrying one bottle each time. What we do can be compared to someone who fetches two crates of beer in one go. That is much more efficient." To achieve this, he proposes modifications to the pipelined operator model found in most databases. Benefits are improved scalability and the high performance of “bulk data processing”. "Benchmarks have shown that the performance increased by orders of magnitude, often by a factor of more than 100", Zukowski said. Controlling the data explosion is an important theme of CWI. This research is a good example.