Better ranking for big data using seriation solution

The recent availability of large data sets leads to the challenge of identifying patterns and structures in them. These patterns can be deployed to improve the user experience

Publication date: 15-12-2016


The recent availability of large data sets leads to the challenge of identifying patterns and structures in them. These patterns can be deployed to improve the user experience, like finding better rankings for ‘the best hotels that are still available in the Christmas holiday near the city centre of Rome’. Researcher Matteo Seminaroti from Centrum Wiskunde & Informatica (CWI) in Amsterdam developed methods to improve sorting and ranking methods by using combinatorial optimization techniques for so-called seriation problems. He defended his PhD thesis, entitled ‘Combinatorial Algorithms for the Seriation Problem’ on 2 December 2016 at Tilburg University. His research can be used for better and faster ranking big data in machine learning and data analysis.  

Seminaroti says: “Ordering results is increasingly important in applications where companies analyze data that do not have a natural order, like user ratings, images, music, movies or nodes in social networks. For large data sets it is often more practical to express preferences as relative comparisons between any two objects. It is, for instance, difficult to decide which movie is number 70 of your top-100 favourite films but it is easy to choose between any two of them. In data analysis, mainly two techniques are used: classification and clustering. I focus on a third technique: seriation, which strives to rank objects according to their relative similarity.”

“For this seriation problem I designed an algorithm, which, for the first time, extends so-called multisweep graph search algorithms to weighted graphs. What strikes me is that, although my new algorithm is extremely simple and easy to implement, it took me a year to prove its correctness! It is called Similarity-First-Search (SFS) and it enables us to analyze larger data sets and networks. It works quite fast, sorting and ranking thousands of objects in less than a minute. I’m currently busy making a library for the statistical program R to implement my new algorithm. Starting from this, we can develop a lot of new concepts.”

The PhD research of Matteo Seminaroti was carried out at CWI in the Networks and Optimization research group, and was funded by the ITN European project Mixed Integer Nonlinear Optimization (MINO). His promotor is Monique Laurent (CWI and Tilburg University) and his copromotor is Renata Sotirov (Tilburg University).

More information:

Thesis

Networks & Optimization group

Data sorting without the new seriation method. Source: CWI.Data sorting after the new seriation method. Source: CWI.Picture: left the original unordered matrix of data, where colours indicate the intensity of similarity of results; right the ordered matrix, where similar objects are ordered close to each other. This makes ranking results better and faster. Source: CWI.