Evolutionary Intelligence Seminars 2020

Overview of the seminars in 2020

Date: 15 December 2020

Arkadiy Dushatskiy
Title: Deep learning for real-world medical image segmentation

Abstract: Automatic organ segmentation is an essential part of automatic radiotherapy treatment planning. We study how state-of-the-art deep learning methods can be successfully applied to real-world medical image segmentation. The research is based on datasets used in clinical practice such as MRI data of radiotherapy treatments at AMC.

Date: 1 December 2020

Timo Deist
Title: Multi-objective inference by hypervolume-based Pareto front generation

Abstract: Multi-objective (MO) decision-making requires trading-off conflicting goals. To assist decision-making when preferred trade-offs are unknown, MO optimization literature describes techniques to approximate Pareto fronts which consist of optimal decisions for each trade-off. In this talk, we present preliminary results on treating MO statistical inference as a machine learning problem using a hypervolume-based loss function. We describe how to train sets of neural networks to produce Pareto front estimates for new problem instances.

Date: 17 November 2020

Solon Pissis
Title: Pattern Masking for Dictionary Matching

Abstract: Data masking is a common technique for sanitizing sensitive data maintained in database systems, and it is also becoming increasingly important in various application areas, such as in record linkage of personal data. In this talk, we will investigate the Pattern Masking for Dictionary Matching (PMDM) problem. In PMDM, we are given a dictionary D of d strings, each of length L, a query string q of length L, and a positive integer z, and we are asked to compute a smallest set K⊆ {1,...,L}, so that if q[i], for all i∈K, is replaced by a wildcard, then q matches at least z out of d strings from D.

Date: 3 November 2020

Haodi Zhong
Title: Clustering datasets with demographics and diagnosis codes

Abstract: Clustering data derived from Electronic Health Record (EHR) systems is important to discover relationships between the clinical profiles of patients and as a pre-processing step for analysis tasks, such as classification. However, the heterogeneity of these data makes the application of existing clustering methods difficult and calls for new clustering approaches. In this paper, we propose the first approach for clustering a dataset in which each record contains a patient’s values in demographic attributes and their set of diagnosis codes. Our approach represents the dataset in a binary form in which the features are selected demographic values, as well as combinations (patterns) of frequent and correlated diagnosis codes. This representation enables measuring similarity between records using cosine similarity, an effective measure for binary-represented data, and finding compact, well-separated clusters through hierarchical clustering. Our experiments using two publicly available EHR datasets, comprised of over 26,000 and 52,000 records, demonstrate that our approach is able to construct clusters with correlated demographics and diagnosis codes, and that it is efficient and scalable.

Date: 20 October 2020

Rosanne Wallin
Title: Applicability of phylogenetic network algorithms to represent the evolutionary history of SARS-CoV-2

The outbreaks of SARS (2003), MERS (2012) and most recently COVID-19 have led to a lot of public and scientific interest into the origins of the coronaviruses that cause these diseases. Traditionally, evolutionary history is inferred and displayed by constructing phylogenetic trees, but these are restricted to displaying vertical evolution. Phylogenetic networks are designed to represent more complex evolutionary relationships and are able to display horizontal evolutionary events, such as recombination in viruses. Several algorithms have been developed to construct such networks, but their suitability for coronavirus data is still unclear. In this presentation, I will shortly explain the concepts of viral recombination and phylogenetic networks and talk about the applicability of five phylogenetic network methods (TriLoNet, TriL2Net, Tree-Child Networks, Temporal Hybridization Number and Maximum-Pseudo Likelihood in PhyloNet) to a genomic data set of SARS-CoV-2 and related coronaviruses. I will show the influence of preprocessing steps (taxon selection, edge contraction and resolving multifurcations) and discuss the limitations of the different methods in terms of input size and consistency of the constructed networks. I will also discuss the biological interpretation of the resulting networks with respect to the evolutionary history of SARS-CoV-2.

Date: 6 October 2020

Monika Grewal
Title: Developing a deep learning method for automatic organ segmentation

Abstract: Automatic organ segmentation can save hours of manual work required by clinicians in the process of radiotherapy treatment planning. Despite the availability of a plethora of deep learning methods for semantic segmentation and their surprisingly good performance on new unseen data, developing a deep learning method for a new problem is still full of challenges. At the bottom of these challenges is the need for a large annotated dataset for supervised learning of a neural network. In the medical imaging domain, the time and efforts required for building a large annotated dataset become even scarcer and costlier due to the underlying requirement of clinical expertise for annotating the data. Therefore, in order to avoid the need for manual annotation, we scraped a big dataset from the clinical database of a hospital along with the clinically available annotations. In this presentation, I will talk about the steps followed by us to develop a considerable size of fully annotated data suitable for supervised learning. I will also discuss the effects of different decisions related to sampling, preprocessing, and cleaning of data on the performance of baseline deep learning method for organ segmentation.

Date: 22 September 2020

Tom den Ottelander
Title: What matters most for Neural Architecture Search: An analysis of search strategies, from simple to complex

Abstract: Computer vision tasks, like supervised image classification, are effectively tackled by convolutional neural networks, provided that the architecture, which defines the structure of the network, is set correctly. Neural Architecture Search (NAS) is a relatively young and increasingly popular field that is concerned with automatically optimizing the architecture of neural networks. Previously known work shows that even though recent works typically develop increasingly complex search strategies for NAS, several do not significantly outperform simple approaches like randomly sampling from the search space on single-objective NAS tasks. Additionally, proper ablation studies are often missing, therefore it is uncertain which mechanisms are fundamental for algorithms to achieve excellent NAS performances. In the first part of this thesis, Local Search (LS) and Uniform Size Random Search (USRS), are proposed for multi-objective (MO) NAS, demonstrating that very simple algorithms can provide searching performances close to state-of-the-art evolutionary algorithms (EAs), while outperforming random search. The second part explores what mechanisms are essential for the Multi-Objective Gene-pool Optimal Mixing Evolutionary Algorithm (MO-GOMEA), a state-of-the-art model-based EA, to achieve excellent performances for searching NAS spaces. The automatic population-sizing scheme of MO-GOMEA offers a welcome anytime-performance, but objective space clustering has only a small beneficial impact. The number of clusters can be set arbitrarily. Extreme clusters can be enabled to the practitioner’s preference, resulting in different search behaviors. The improved performance gained by automatic linkage learning is limited, although it can be helpful on future, more complex search spaces.

Date: 8 September 2020

Speaker: Michelle Sweering
Title: String Sanitization under Edit Distance: Improved and Generalized

Let W be a string of length n over an alphabet Σ, k be a positive integer, and S be a set of length-k substrings of W. The ETFS problem asks us to construct a string X_ED such that:
(i) no string of S occurs in X_ED;
(ii) the order of all other length-k substrings over Σ is the same in W and in X_ED; and
(iii) X_ED has minimal edit distance to W.
When W represents an individual's data and S represents a set of confidential patterns, the ETFS problem asks for transforming W to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019].

ETFS can be solved in O(n^2k) time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in O(n^{2−δ}) time, for any δ>0, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows:
(i) an O(n^2log^2k)-time algorithm to solve ETFS; and
(ii) an O(n^2log^2n)-time algorithm to solve AETFS, a generalization of ETFS in which the elements of S can have arbitrary lengths.

Our algorithms are thus optimal up to polylogarithmic factors, unless SETH fails. Let us also stress that our algorithms work under edit distance with arbitrary weights at no extra cost. As a bonus, we show how to modify some known techniques, which speed up the standard edit distance computation, to be applied to our problems. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars.