# Life Sciences and Health Seminar

## Upcoming

31/05 Sanne Abeln

## Past

Date: 17 May 2022

Speaker: **Monika Grewal**

Title: Uncertainty Estimation to Learn Effectively from Clinically Available Data for Organs at Risk Segmentation

Abstract: Epistemic uncertainty refers to the lack of knowledge in a model about the underlying data. Epistemic uncertainty is higher for out-of-domain samples or in other words the samples for which making a prediction is difficult (hard examples). In this presentation, I will discuss how this concept can be used to effectively learn a deep learning model for automatic segmentation of organs at risk from a large dataset of Computed Tomography (CT) scans containing class imbalance and missing annotations. I will share the basic idea, preliminary results, and challenges.

Speaker: **Dazhuang Liu**

Title: Probability distribution based constants optimization in symbolic regression

Abstract: Symbolic regression (SR) is a method to discover machine learning models from data that are mathematical equations. In fact, these models are made by composing operators (+, -, * and /), variables (features of the dataset), and constants (0.1, 10, e, Pi). Since SR models are composed of commonly used mathematical operations, they are well suited for human interpretation (important for explainable AI). However, SR is hard because both the structure of the model (i.e., how to compose the operations) and the constant values must be optimized at the same time. Several works concerning this topic have been proposed in the literature, such as combining evolutionary algorithms with gradient descent. However, existing approaches have several limitations. To solve this problem in an efficient and effective way, we are working on a new evolutionary algorithm (EA) to jointly optimize the structure and the constants at the same time. Namely, our approach works by clustering together multiple “structures” that may benefit from similar constants from the population of the EA. Then, for each cluster, a different probability distribution (PD) is modelled and used to sample new promising constant values. Early experiment suggest that the approach is very promising.

Date: 19 April 2022

Speaker: ** Leah Dickhoff**

Title: Automated optimization for cervix brachytherapy requires more then the EMBRACE-II planning aims

Abstract: A bi-objective model was introduced for MO-RV-GOMEA-based optimization for prostate HDR brachytherapy, intuitively capturing the dose trade-off between tumour target coverage and healthy organ sparing. This approach entails direct optimization on dose volume indices specified in a clinical protocol. We studied its immediate extension to cervical cancer brachytherapy by optimizing on planning aims from the officially recommended EMBRACE-II protocol. Results were judged as clinically unsatisfactory by medical specialists because of undesirable properties in the dose distributions, indicating that additional aims are needed. A first approach optimizing on both the EMBRACE-II aims as well as additionally tuned aims has been developed and indicates promising results.

Speaker: **Evi Sijben **

Title: AI for decision support of paraganglioma treatment planning

Abstract: Paraganglioma are tumors that appear in the head and neck area. Although they are usually benign slow-growing tumors, they can cause severe complaints. These complaints are dependent on the growth of the tumor. being able to predict the future volume of the tumor would be helpful to give inevitable treatment early on but to avoid unnecessary and stressful surveillance. In this talk, methods for paraganglioma tumor measurement and knowledge about tumor growth are presented as well as possible ways to further support the decision making process with help of AI.

Date: 22 March 2022

Speaker: **Aleksandr Chebykin**

Title: Searching for architectures and hyperparameters of neural networks

Abstract: What does it actually mean to search for architectures and hyperparameters of neural networks? In this talk I will give a general step-by-step explanation of how one solves problems via neural networks, in order to provide the necessary context to our research; having explained the existing process, I'll describe how automatic architecture & hyperparameter can potentially improve it, and what our current research direction is.

Date: 22 February 2022

Speaker: ** Marco Virgolin**

Title: Counterfactual explanations: A bird-eye view

Abstract: Counterfactual explanations (CEs) are a powerful means for understanding how decisions made by machine learning models can be changed. In particular, a CE prescribes how a user (a point in feature space x with coordinates age_x=32, gender_x=Male, salary_x=...) needs to intervene (x->x') so that the classification made by a model (f(x)="no loan") changes, typically in favor of the user (f(x')="yes loan"). Obtaining "useful" CEs is a search problem that depends on many aspects. To begin with, we must define a distance d such that x'=argmin d(x,x') will reasonably translate to minimal intervention effort in real life. Next, multiple constraints and desiderata must be considered for practical adoption, such as plausibility of intervention (age_x' - age_x > 0), causal relationships (increment in salary->increment in savings), and robustness to perturbations of different nature (e.g., to measurements in x or updates to f). Finally, availability of access to and information on the model is important: f may or may not be re-trained (e.g., to smoothen its decision boundary), expose gradients or provide only labels, be queried an unlimited or limited number of times, and so on. This multitude of important aspects contributes to establishing important links to many research areas, such as optimization (multi-objective, gradient-based or gradient-free, expensive), machine learning, and also areas concerning ethics and fairness.

Speaker: **Esteban Gabory**

Title: On Strings Having the Same Length-k Substrings

Abstract: Let Substr_k(X) denote the set of length-k substrings of a given string X for a given integer k>0. We study the following basic string problem, called Shortest equivalent string: Given a set S_k of n length-k strings and an integer z>0, find a shortest string T such that Substr_k(T)=S_k. The Shortest equivalent string problem arises naturally as an encoding problem in many real-world applications; e.g., in data privacy, in data compression, and in bioinformatics. We answer this problem by showing that it is equivalent to finding a shortest walk that crosses every edge of a graph, and we give an algorithm for the latter problem.

Date: 8 February 2022

Speaker: **Wiktor Zuba**

Title: Stringology - different models and approaches

Abstract: In the talk I am going to present an overview of the results from some of the papers I co-authored with the stringology group of the University of Warsaw. The talk contains an overview of the problem of finding a cover (a smaller text whose occurrences cover all the positions of the input text) in many distinct stringological models and the differences encountered while studying them. A different result shows a use of stringology for solving a graph problem - a compression of a representation of a Hamiltonian cycle of a sigma-tau Cayley graph of permutations. The use of the compressed representation results in finding efficient ranking and unranking algorithms for the cycle. Yet another result shows a conditional hardness of a problem of finding an Abelian square (a word whose two parts contain the same multiset of letters) - that is show, that the problem (almost surely) cannot be solved in strongly subquadratic time (while most stringology papers aim in obtaining linear time solutions to problems).

Date: 25 January 2022

Speaker: **Georgios Andreadis**

Title: Multi-Objective Dual Simplex-Mesh Based Deformable Image Registration for 3D Medical Images

Abstract: Transferring information between images with large anatomical differences is an open challenge in medical image analysis. Existing methods require extensive up-front parameter tuning and have difficulty capturing large deformations and content mismatches. Using a multi-objective optimization approach with a dual-dynamic grid transformation model has previously proven effective at overcoming these issues while also producing a diverse set of high-quality registrations for 2D images. We successfully introduce the first method for multi-objective deformable image registration for 3D images, based on a new 3D dual-dynamic grid transformation model and using the Real-Valued Gene-pool Optimal Mixing Evolutionary Algorithm (RV-GOMEA) parallelized on a GPU. Our proof-of-concept prototype shows promising results on a synthetic and clinical registration problem. This talk is based on a conference paper to be published at the SPIE’22 Medical Imaging - Image Processing conference.

Speaker: **Solon P Pissis**

Title: Differentially Private String Sanitization for Frequency-Based Mining Tasks

Abstract: Strings are used to model genomic, natural language, and web activity data, and are thus often shared broadly. However, string data sharing has raised privacy concerns stemming from the fact that knowledge of length-k substrings of a string and their frequencies (multiplicities) may be sufficient to uniquely reconstruct the string; and from that the inference of such substrings may leak confidential information. We thus introduce the problem of protecting length-k substrings of a single string S by applying Differential Privacy (DP) while maximizing data utility for frequency-based mining tasks. Our theoretical and empirical evidence suggests that classic DP mechanisms are not suitable to address the problem. In response, we employ the order-k de Bruijn graph G of S and propose a sampling-based mechanism for enforcing DP on G. We consider the task of enforcing DP on G using our mechanism while preserving the normalized edge multiplicities in G. We define an optimization problem on integer edge weights that is central to this task and develop an algorithm based on dynamic programming to solve it exactly. We also consider two variants of this problem with real edge weights. By relaxing the constraint of integer edge weights, we are able to develop linear-time exact algorithms for these variants, which we use as stepping stones towards effective heuristics. An extensive experimental evaluation using real-world large-scale strings (in the order of billions of letters) shows that our heuristics are efficient and produce near-optimal solutions which preserve data utility for frequency-based mining tasks.

*This talk is based on a conference paper presented at ICDM 2021 (joint work with Huiping Chen, Changyu Dong, Liyue Fan, Grigorios Loukides, and Leen Stougie).*

Date: 11 January 2022

Speaker: **Arkadiy Dushatskiy**

Title: Data variation-aware medical image segmentation

Abstract: Deep learning algorithms have become the golden standard for segmentation of medical imaging data. In most works, the variability and heterogeneity of real clinical data is acknowledged to still be a problem. One way to automatically overcome this is to capture and exploit this variation explicitly. Here, we propose an approach that improves on our previous work in this area and explain how it potentially can improve clinical acceptance of (semi-)automatic segmentation methods. In experiments with a real clinical dataset of CT scans with prostate segmentations, our approach provides an improvement of several percentage points in terms of Dice and surface Dice coefficients compared to a standard Deep Learning approach. Noticeably, the largest improvement occurs in the upper part of the prostate that is known to be most prone to inter-observer segmentation variation.

Speaker: ** Michelle Sweering**

Title:Making de Bruijn Graphs Eulerian via Near-Optimal Algorithms for Connecting and Balancing

Date: 14 December 2021

Speaker: **Haodi Zhong **(King's College London)

Title: Clustering demographics and sequences of diagnosis codes

Abstract: A Relational-Sequential dataset (or RS-dataset for short) contains records comprised of a patients values in demographic attributes and their sequence of diagnosis codes. The task of clustering an RS-dataset is helpful for analyses ranging from pattern mining to classification. However, existing methods are not appropriate to perform this task. Thus, we initiate a study of how an RS-dataset can be clustered effectively and efficiently. We formalize the task of clustering an RS-dataset as an optimization problem. At the heart of the problem is a distance measure we design to quantify the pairwise similarity between records of an RS-dataset. Our measure uses a tree structure that encodes hierarchical relationships between records, based on their demographics, as well as an edit-distance-like measure that captures both the sequentiality and the semantic similarity of diagnosis codes. We also develop an algorithm which first identifies k representative records (centers), for a given k, and then constructs clusters, each containing one center and the records that are closer to the center compared to other centers. Experiments using two Electronic Health Record datasets demonstrate that our algorithm constructs compact and well-separated clusters, which preserve meaningful relationships between demographics and sequences of diagnosis codes, while being efficient and scalable.

This is a joint work with Grigorios Loukides and Solon P. Pissis; the paper appeared in IEEE Journal of Biomedical and Health Informatics [https://doi.org/10.1109/JBHI.2021.3129461].

Date: 30 November 2021

Speaker: **Arthur Guijt**

Abstract: Linkage Learning is at the core of most of GOMEA's scalability improvements, and is a core component of the GOMEA family of approaches. Yet if Linkage Learning fails to work, performance is reduced drastically. In this presentation I'll briefly discuss the current approach, the implicit assumptions that are made, and the failure modes that may occur in practice. Furthermore, I will talk about using neighborhoods - more specifically Local Neighborhoods, how learning linkage locally resolves some of these common failure modes, and how (much) the definition of locality matters.

Speaker: **Anton Bouter**

Title: Optimal Mixing Evolutionary Algorithms for Large-Scale Real-Valued Optimization

Abstract: It is known that to achieve efficient scalability of an Evolutionary Algorithm (EA), dependencies (also known as linkage) must be properly taken into account during variation. In a Gray-Box Optimization (GBO) setting, exploiting prior knowledge regarding these dependencies can greatly benefit optimization. We specifically consider the setting where partial evaluations are possible, meaning that the partial modification of a solution can be efficiently evaluated. Such problems are potentially very difficult, e.g., non-separable, multi-modal, and multi-objective. The Gene-pool Optimal Mixing Evolutionary Algorithm (GOMEA) can effectively exploit partial evaluations, leading to a substantial improvement in performance and scalability. Such partial evaluations have been applied to various real-world problems, including the optimization of treatment plans for prostate cancer, and deformable image registration.

Date: 16 November 2021

Speaker: **Joe Harrison**

Title: How GOMEA can be applied to Cartesian Genetic Programming

Abstract: Genetic Programming (GP) has been a popular technique for symbolic regression. In GP expression trees are evolved by means of subtree crossover and point mutation to fit a function. In Cartesian Genetic Programming (CGP) a feed-forward acyclic graph is used rather than a tree and typically the only variation operators are point mutations. An advantage of using CGP is that nodes in the graph can be reused and more complex configurations such as graphs with skip-layers or multiple outputs are possible. The Gene-Optimal Mixing Evolutionary Algorithm (GOMEA) aims to find linkage amongst components (e.g. nodes in a graph). GOMEA was successfully applied to GP. It groups tree nodes with linkage into subsets of nodes and exchanges these subsets as means of variation. In this presentation I will discuss how GOMEA can be applied to Cartesian Genetic Programming, the pros and cons of this application and how this affects node reuse.

Speaker: **Leen Stougie**

Date: 2 November 2021

Spekaer: **Aleksandr Chebykin**

Title: *Efficient Neural Architecture Search for ensemble learning*

Abstract: Neural Architecture Search is the problem of learning a task-specific neural network structure from data. During the search, many candidate architectures need to be considered, but it is computationally infeasible to independently train them all. Weight sharing makes the search more efficient but also reduces the space of possible architectures. The tradeoffs of differently restricted search spaces are not obvious. Our idea is to traverse several spaces simultaneously in order to efficiently find an ensemble that is both small and well-performing.

Spekaer: **Solon Pissis**

Title: * Bidirectional String Anchors: A New String Sampling Mechanism*

Abstract: In a previous LSH seminar, we presented bidirectional string anchors, a new mechanism for sampling strings, which has small density and is computable in linear time. Furthermore, we showed that by using this sampling, we can efficiently construct a small sketch that answers pattern matching queries in near-optimal time. In this seminar, we will present an experimental evaluation, which highlights the applicability of our sketch.

Date: 19 October 2021

Speaker: **Riccardo Guidotti**

Title: *Explaining Explanation Methods*

Abstract: The most effective Artificial Intelligence (AI) systems exploit complex machine learning models to fulfil their tasks due to their high performance. Unfortunately, the most effective machine learning models use for their decision processes a logic not understandable from humans that makes them real black-box models. The lack of transparency on how AI systems make decisions is a clear limitation in their adoption in safety-critical and socially sensitive contexts. Consequently, since the applications in which AI are employed are various, research in eXplainable AI (XAI) has recently caught much attention, with specific distinct requirements for different types of explanations for different users. In this webinar, we briefly present the existing explanation problems, the main strategies adopted to solve them, and the most common types of explanations are illustrated with references to state-of-the-art explanation methods able to retrieve them.

Short bio: Riccardo Guidotti is Assistant Professor at University of Pisa. Riccardo Guidotti was born in 1988 in Pitigliano (GR) Italy. In 2013 and 2010 he graduated cum laude in Computer Science (MS and BS) at University of Pisa. He received the PhD in Computer Science with a thesis on Personal Data Analytics in the same institution. He is currently an Assistant Professor at the Department of Computer Science University of Pisa, Italy and a member of the Knowledge Discovery and Data Mining Laboratory (KDDLab), a joint research group with the Information Science and Technology Institute of the National Research Council in Pisa. He won the IBM fellowship program and has been an intern in IBM Research Dublin, Ireland in 2015. He also won the DSAA New Generation Data Scientist Award 2018. His research interests are in explainable artificial intelligence, interpretable machine learning, quantum computing, fairness and bias detection, data generation and causal models, personal data mining, clustering, analysis of transactional data.

Date: 5 October 2021

Speaker: **Sanne Abeln**

Title: *Bioinformatics: from amyloid formation to oncology*

Abstract: In our research group at the Computer Science Department of the VU we look at topics ranging from molecular dynamics simulations to large scale omics analysis with machine learning, most of it applied to the health domains of neurodegenerative disease and oncology. In this talk I will give an overview of current results on multi-task learning for protein structural features, simulations that show the role of hydrophobicity in amyloid formation, a machine learning based breakpoint analysis for colorectal cancer and a simple model for prediction of proteins in extracellular vesicles.

Speaker: ** Leah Dickhoff**

Title: *Adaptive optimization for bi-objective treatment planning in cervical cancer brachytherapy*

Abstract: The previously developed bi-objective optimization method using the Real-Valued Gene-pool Optimal Mixing Evolutionary Algorithm (RV-GOMEA) for prostate high-dose-rate brachytherapy (BT) has been extended to cervical cancer BT. This model directly optimizes on dose volume indices (DVIs), which describe specific maximum or minimum dose to subvolume relationships for each of the targets and organs at risk. Discussions with medical specialists have revealed that optimizing solely on the DVIs from the ESTRO-recommended protocol EMBRACE II does not suffice to obtain clinically acceptable treatment plans. Therefore, additional (potentially hospital-specific) DVIs need to be added to the objective functions. Because of interpatient variations in organ, target and applicator geometry, patient-specific aspiration values are required for these added DVIs. This entails the need for an adaptive optimization method, possibly incorporating different priorities in the objectives.

Date: 21 September 2021

Speaker: **Giulia Bernardini**

Title: *Incomplete Directed Perfect Phylogeny in Linear Time*

Abstract: Reconstructing the evolutionary history of a set of species is a central task in computational biology. In real data, it is often the case that some information is missing: the Incomplete Directed Perfect Phylogeny problem asks, given a collection of species described by a set of binary characters with some unknown states, to complete the missing states in such a way that the result can be explained with a perfect directed phylogeny. Pe'er et al. proposed a solution that takes quasilinear time. Their algorithm relies on pre-existing dynamic connectivity data structures: a computational study recently conducted by Fernàndez-Baca and Liu showed that, in this context, complex data structures perform worse than simpler ones with worse asymptotic bounds. This gives us the motivation to look into the particular properties of the dynamic connectivity problem in this setting, so as to avoid the use of sophisticated data structures as a blackbox. Not only are we successful in doing so, and give a much simpler quasilinear-time algorithm for the Incomplete Directed Perfect Phylogeny problem; our insights into the specific structure of the problem lead to an asymptotically faster algorithm, that runs in optimal linear time.

Speaker: **Cedric Rodriguez**

Title: * Bio-mechanical modeling of large deformations of the uterus and bladder between External Beam Radiotherapy and Brachytherapy for cervical cancer*

Abstract: The validation of Deformable Image Registration (DIR) algorithms within the radiotherapy field is challenging due to the missing ground truth in medical imaging. Therefore, we are trying to create a virtual phantom of the abdominal region to perform controlled bio-mechanical deformations. Our goal is to simulate the expected tissue deformation of the uterus, cervix, vagina and bladder due to the applicator insertion between External Beam Radiotherapy (EBRT) and Brachytherapy (BT) for cervical cancer. Our approach consists of the generation of volumetric models of the organ at risk (OARs) using the initial diagnostic MR scans. Thereafter, a Finite Element Method will simulate the interaction between the applicator and the OARs to predict the final organ state before BT. This approach will result in ground truth deformation vector fields that could be used to create augmented medical imaging for the validation of state-of-the-art DIR algorithms for radiotherapy.

Date: 7 September 2021

Speaker: **Prof. Marcel van Herk, Chair in Radiotherapy Physics, The University of Manchester **

Title: *On software development for medical applications *

Date: 27 July 2021

Speaker: **Arthur Guijt**

Title: *Improving GOMEA for Neural Architecture Search by ‘Kernelization’*

Abstract: Currently GOMEA for discrete spaces performs its search in a mostly global fashion, learning a single linkage model per population, and visiting each solution exactly once a generation to perform GOM. Within the context of Neural Architecture Search these traits have some notable downsides. The aim is to do get rid of this generational lockstep by ‘kernelization’, letting each solution be its own ‘individual’ that uses similarity & locality to improve itself, for example by learning its own linkage model. I display some initial results obtained by running GOMEA in this fashion, showing that a performance increase can be obtained in certain settings.

Date: 13 July 2021

Speaker: **Damy Ha**

Title: *Hybridizing UHV-GOMEA with UHV-ADAM for real-valued multi objective optimization*

Abstract: In real-valued single objective optimization, algorithms that make use of gradient information have shown to very efficient in finding the optimal solution if the problem has few local optima. Despite evolutionary algorithms being known for their ability to avoid converging to local optima, problems that have many local optima are generally still solved faster by so called hybrid evolutionary algorithms (which employ gradient information) compared to pure evolutionary algorithms. This faster convergence of hybrid evolutionary algorithms has not been shown for multi objective problems yet. Recently a new technique has been created that redefines multi objective problems to single objective problems. This brings back the question if using this new technique allows a hybrid evolutionary algorithm to converge faster to the optimal solution than a pure evolutionary algorithm. This work hybridizes the evolutionary algorithm UHV-GOMEA with the gradient technique UHV-ADAM and investigates its performance.

Date: 29 June 2021

Speaker: **Martijn Bosma**

Title: *The Effect of Performance Estimation on Evolutionary Neural Architecture Search for Medical Image Segmentation*

Abstract: In order to automate MRI/CT scan interpretation, accurate organ segmentation ML is needed. As organs have different characteristics (volume, density, clear boundaries), using Neural Architecture Search to create neural networks tailored to a specific task, can help in obtaining accurate segmentations. In this research, a novel search space is designed and evaluated. The performance of the obtained networks is compared to other SOTA neural networks. Next to these results, an analysis is done on the effect of performance estimation on evolutionary search algorithms in terms of accuracy and speed-up. This analysis demonstrates that performance estimation at different Spearman correlation levels can deteriorate the accuracy of a NAS algorithm, for similar run times.

Date: 15 June 2021

Speaker: **Joost Commandeur**

Title: *Improving the homogeneity (reducing hot spots) of treatment plans for high dose rate brachytherapy as generated by BRIGHT (MO-RV-GOMEA)*

Abstract: BRIGHT (an implementation of Multi Objective Real Valued Gene-Pool Optimal Mixing) has been in use to create brachytherapy treatment plans for patients with prostate cancer since the spring of 2020. A treatment plan consist of so-called 'dwell times', which is the duration that a radioactive source stays in a single 'dwell position' within a catheter (hollow needle) that has been placed inside the body. The goal of such a treatment plan is to deliver the prescribed radioactive dose to the target organs (i.a. prostate), whilst sparing surrounding organs and tissue (i.a. bladder). This multi-objective goal is formulated in a clinical protocol which states the prescribed upper and lower bounds for the different organs involved. In practice it has been noticed that even though the plans as generated by BRIGHT adhere to the clinical protocol, the clinical experts still adjust the resulting plans. In our research we try to tackle one of the aspects that requires adjustment, which is the homogeneity of the delivered radioactive dose by adjusting BRIGHT.

Date: 1 June 2021

Speaker: **Evi Sijben**

Title: *Diverse high quality solutions using multi-tree multi-objective gene pool optimal mixing evolutionary algorithms*

Abstract: Being able to find diverse high quality solutions in machine learning can have many purposes. One such example is to compute uncertainty estimates. Here we assume that the more aligned the different models are the more certain we are about the validity of the output. We search for multiple models, using a multi-tree representation. These multi-tree models are evaluated on multiple objectives, namely diversity between trees and quality of all trees. By doing so, we integrate into the search process a desirable characteristic of human explanatory cognition; exploring alternative hypotheses to explain observations.

Speaker: **Arkadiy Dushatskiy**

Title: *A Novel Surrogate-assisted Evolutionary Algorithm Applied to Partition-based Ensemble Learning*

Abstract: We propose a novel surrogate-assisted Evolutionary Algorithm for solving expensive combinatorial optimization problems. We integrate a surrogate model, which is used for fitness value estimation, into a state-of-the-art P3-like variant of the Gene-Pool Optimal Mixing Evolutionary Algorithm (GOMEA) and adapt the resulting algorithm for solving non-binary combinatorial problems. We test the proposed algorithm on an ensemble learning problem. In our experiments we use five classification datasets from the OpenML-CC18 benchmark and Support-vector Machines as learners in an ensemble. The proposed algorithm demonstrates better performance than alternative approaches, including Bayesian optimization algorithms. It manages to find better solutions using just several thousand fitness function evaluations for an ensemble learning problem with up to 500 variables.

Date: 18 May 2021

Speaker: **Leen Stougie**

Title: *An Approximation Algorithm for a Phylogenetic Network Problem*

Abstract: Often different parts of the genome give rise to different phylogenetic relationships in the form of trees. This is usually not a matter of inexact data, but a true representation of the phenomenon that there may have occurred hybridization between subspecies. This should rather be displayed by a network than by a tree. Therefore rather recently various models for displaying such hyberdization events have been proposed. One of the oldest ones is the Maximum Agreement Forest problem. This is an NP-hard problem and over the past two decades ever improving approximation algorithms have been presented. In this lecture I will explain you the problem and give its raison d'être. Then I will present a high brow view of a 2-approximate algorithm for the problem, which is currently the best in terms of worst-case performance analysis. This is joint work with Neil Olver (London School of Econ.), Frans Schalekamp (Cornell), Anke van Zuylen (Cornell)

Speaker: **Georgios Andreadis**

Title: *Multi-Objective Deformable Image Registration for 3D Images Using the Gene-pool Optimal Mixing Evolutionary Algorithm*

Abstract: Finding the most likely deformation of one image to another image is a problem with important applications in medical imaging, where large deformations and structural changes are common. Approaching this problem of Deformable Image Registration (DIR) from a multi-objective perspective has proven successful at producing a set of realistic registrations for 2D images for the user to choose from. This multi-objective optimization process is driven by the Real-Valued Gene-pool Optimal Mixing Evolutionary Algorithm (RV-GOMEA). Recent parallelization efforts with this algorithm on the Graphics Processing Unit (GPU) have delivered large speed-ups for 2D images by exploiting the conditional independence of local regions. This work builds on these advances and adapts the approach to support 3D images, introducing a new spatial model and a refined deformation energy model while still supporting annotated guidance information and multi-resolution schemes. We find that our proof-of-concept prototype can successfully tackle synthetic registration problems and are currently conducting experiments on patient scans to validate our approach on real-world problems.

Date: 4 May 2021

Speaker: **Alexandr Chebykin**

Title: *Improving Neural Architecture Search by encouraging diversity*

Abstract: Neural Architecture Search (NAS) is concerned with learning structure of a neural network from data. During the search, many candidate architectures need to be considered, but computational constraints make it impossible to independently train all candidates until convergence (or even evaluate them all). Various techniques make the search more efficient, while at the same time reducing the space of architectures that can be found. In this talk, I will describe our approach to improving upon a SOTA algorithm by optimizing an additional objective function that measures novelty of newly discovered solutions. Limitations of the approach, as well as of the current NAS paradigm in general, will also be discussed.

Speaker:** Joe Harrison**

Title: *Differentiable Cartesian genetic programming for small interpretable symbolic expressions*

Abstract: Interpretability is a desirable quality when it comes to machine learning. In genetic programming typically (binary) trees are used to evolve symbolic expressions that are human-readable. The leaf nodes of the tree can either be input data or ephemeral random constants (ERC) and the other nodes are operators (x, +, -, %) that take the output of its child nodes as input. In contrast, Cartesian genetic programming (CGP) uses a directed acyclic graph as a genotype. This allows for skip connections and reuse of salient patterns within the graph and possibly leads to more expressive formulas compared to a tree-based GP algorithm. Another issue with standard GP is that the ERC leaf nodes are instantiated once and do not change over time. By making the expression differentiable the 'constant' leaf nodes can be updated using backpropagation. This possibly takes away part of the burden of the GP algorithm having to evolve specific values using operator nodes. GP trees often suffer from bloat, a significant increase in solution size with insignificant improvement in terms of performance, which makes resulting expressions long and unreadable. CGP naturally uses a constrained template which prevents bloat. This template makes CGP an ideal candidate for Gene Optimal Mixing (GOM), a method where subsets of nodes are exchanged among solutions. GOM ensures improvement and offers the possibility of exploiting linkage information. With these adjustments, more expressive and shorter formulas are possible.

Date: 20 April 2021

Speaker: **Leah Dickhoff**

Title: Cervical cancer case modelling for bi-objective treatment planning in brachytherapy

Abstract: The previously developed bi-objective optimization method using the Real-Valued Gene-pool Optimal Mixing Evolutionary Algorithm (RV-GOMEA) for prostate high-dose-rate brachytherapy (BT) is extended to cervical cancer patients. The clinical objectives in terms of dose and dose-point indices, describing minimum doses to targets and maximum doses to organs at risk, are calculated directly from the EMBRACE-II protocol. They are then converted into a bi-objective formulation in which the trade-off is characterized by the Least Coverage Index versus Least Sparing Index. In order to generate clinically acceptable treatment plans, additional constraints on dwell time modulation and needle dose contribution are deemed necessary.

Speaker: **Timo Deist**

Title: Multi-objective learning for asymmetric Pareto fronts

Abstract: We will revisit the multi-objective (MO) learning problem to predict Pareto fronts presented in December. This time, I will highlight the issue of asymmetric Pareto fronts (asymmetry along the line L_1(x)=L_2(x)) that can occur when the competing loss functions are different in scale and curvature. Our proposed MO approach uses HV maximization and does not require specifying trade-offs before training. It is expected to outperform existing Multi-Task Learning (MTL) methods that require trade-offs to be known a priori which is a difficult requirement when Pareto front characteristics are unknown. Results from three experiments will be used to illustrate differences between our HV maximization-based approach and existing MTL algorithms.

Date: 6 April 2021

Speaker: **Solon Pissis**

Title: Bidirectional string anchors: A new mechanism for string sampling

Abstract: We will present a new mechanism for sampling strings, which has small density and is computable in linear time. Furthermore, we will show that by using this sampling, we can efficiently construct a small sketch that answers pattern matching queries in near-optimal time. This is a joint work (in progress) with Grigorios Loukides from King's College London.

Speaker:** Thomas Uriot**

Title: Interpretable dimensionality reduction using multi-objective Genetic Programming

Abstract: In this talk, we will discuss some of the ways to perform interpretable dimensionality reduction using Genetic Programming. These include using a teacher model, such as a neural-based autoencoder, a fully GP-based autoencoder and a manifold learning approach using GP as the function mapping from high to low dimensions. We also introduce variants of GP representation (vanilla, multi-tree, shared multi-tree). We compare all these approaches on several datasets by looking at their predictive power for downstream classification tasks.

Date: 23 March 2021

Speaker: **Giulia Bernardini**

Title: Constructing Strings Avoiding Forbidden Substrings

Abstract: We consider the problem of constructing strings over an alphabet Σ that start with a given prefix u, end with a given suffix v, and avoid occurrences of a given set of forbidden substrings. In the decision version of the problem, given a set S of forbidden substrings, each of length k, over Σ, we are asked to decide whether there exists a string x over Σ such that u is a prefix of x, v is a suffix of x, and no element of S occurs in x. Our first result is an O(|u|+|v|+k|S|)-time algorithm to decide this problem. In the more general optimization version of the problem, given a set S of forbidden arbitrary-length substrings over Σ, we are asked to construct a shortest string x over Σ such that u is a prefix of x, v is a suffix of x, and no element of S occurs in x. Our second result is an O(|u|+|v|+||S||*|Σ|)-time algorithm to solve this problem, where ||S|| denotes the total length of the elements of S.

Interestingly, our results can be directly applied to solve the reachability and shortest path problems in complete de Bruijn graphs in the presence of forbidden edges or of forbidden paths. Our algorithms are motivated by data privacy, and in particular, by the data sanitization process. In the context of strings, sanitization consists in hiding forbidden substrings from a given string by introducing the least amount of spurious information. We consider the following problem. Given a string w of length n over Σ, an integer k, and a set S of forbidden substrings, each of length k, over Σ, construct a shortest string y over Σ such that no element of S occurs in y and the sequence of all other length-k fragments occurring in w is a subsequence of the sequence of the length-k fragments occurring in y. Our third result is an O(nk|S|*|Σ|)-time algorithm to solve this problem.

Speaker: **Monika Grewal**

Title: Multi-Objective learning for Deformable Image Registration

Abstract: Deformable Image Registration (DIR) is an optimization task, wherein a Deformation Vector Field (DVF) is optimized to deform the source image such that it aligns with the target image. DIR is inherently multi-objective, requiring maximization or minimization of multiple conflicting objectives e.g., maximize image similarity and minimize deformation magnitude. Further, the computationally expensive nature of the DVF optimization gives a strong motivation to develop learning-based algorithms. In this talk, I will discuss a potential method for multi-objective learning of the DIR task.

Date: 9 March 2021

Speaker: Anton Bouter

Title: GPU-Accelerated Parallel Gene-pool Optimal Mixing applied to Multi-Objective Deformable Image Registration

Abstract: The Real-Valued Gene-pool Optimal Mixing Evolutionary Algorithm (RV-GOMEA) has previously been successfully used to achieve highly scalable optimization of various real-world problems in a gray-box optimization setting. Deformable Image Registration (DIR) is a multi-objective problem, aimed at finding the most likely non-rigid deformation of a given source image so that it matches a given target image. We specifically consider the case where the deformation model allows for finite-element-type modeling of tissue properties. This optimization problem is non-smooth, necessitating techniques like EAs to get good results. Though the objectives of DIR are non-separable, non-neighboring regions of the deformation grid are conditionally independent. We show that GOMEA allows to exploit such knowledge through the large-scale parallel application of variation steps, where each is only accepted when leading to an improvement, on a Graphics Processing Unit (GPU). On various 2-dimensional DIR problems, we find that this way, similar results can be achieved as when sequential processing is performed, while allowing for substantial speed-ups (up to a factor of 111) for the highest-dimensional problems (i.e., the highest deformation-grid resolution). This work opens the door to the extension of this type of DIR to larger (3-dimensional) deformation grids, and its application to other real-world problems.

Date: 23 February 2021

Speaker: Michelle Sweering

Date: 9 February 2021

next set of 8 presentations

Date: 26 January 2021

set of 8 presentations

Date: 15 December 2020

Speaker: Arkadiy Dushatskiy

Title: Deep learning for real-world medical image segmentation

Abstract: Automatic organ segmentation is an essential part of automatic radiotherapy treatment planning. We study how state-of-the-art deep learning methods can be successfully applied to real-world medical image segmentation. The research is based on datasets used in clinical practice such as MRI data of radiotherapy treatments at AMC.

Date: 1 December 2020

Speaker: Timo Deist

Title: Multi-objective inference by hypervolume-based Pareto front generation

Abstract: Multi-objective (MO) decision-making requires trading-off conflicting goals. To assist decision-making when preferred trade-offs are unknown, MO optimization literature describes techniques to approximate Pareto fronts which consist of optimal decisions for each trade-off. In this talk, we present preliminary results on treating MO statistical inference as a machine learning problem using a hypervolume-based loss function. We describe how to train sets of neural networks to produce Pareto front estimates for new problem instances.

Date: 17 November 2020

Speaker: Solon Pissis

Title: Pattern Masking for Dictionary Matching

Abstract: Data masking is a common technique for sanitizing sensitive data maintained in database systems, and it is also becoming increasingly important in various application areas, such as in record linkage of personal data. In this talk, we will investigate the Pattern Masking for Dictionary Matching (PMDM) problem. In PMDM, we are given a dictionary D of d strings, each of length L, a query string q of length L, and a positive integer z, and we are asked to compute a smallest set K⊆ {1,...,L}, so that if q[i], for all i∈K, is replaced by a wildcard, then q matches at least z out of d strings from D.

Date: 3 November 2020

Speaker: Haodi Zhong Title: Clustering datasets with demographics and diagnosis codes

Abstract: Clustering data derived from Electronic Health Record (EHR) systems is important to discover relationships between the clinical profiles of patients and as a pre-processing step for analysis tasks, such as classification. However, the heterogeneity of these data makes the application of existing clustering methods difficult and calls for new clustering approaches. In this paper, we propose the first approach for clustering a dataset in which each record contains a patient’s values in demographic attributes and their set of diagnosis codes. Our approach represents the dataset in a binary form in which the features are selected demographic values, as well as combinations (patterns) of frequent and correlated diagnosis codes. This representation enables measuring similarity between records using cosine similarity, an effective measure for binary-represented data, and finding compact, well-separated clusters through hierarchical clustering. Our experiments using two publicly available EHR datasets, comprised of over 26,000 and 52,000 records, demonstrate that our approach is able to construct clusters with correlated demographics and diagnosis codes, and that it is efficient and scalable.

Date: 20 October 2020

Speaker: Rosanne Wallin

Title: Applicability of phylogenetic network algorithms to represent the evolutionary history of SARS-CoV-2

The outbreaks of SARS (2003), MERS (2012) and most recently COVID-19 have led to a lot of public and scientific interest into the origins of the coronaviruses that cause these diseases. Traditionally, evolutionary history is inferred and displayed by constructing phylogenetic trees, but these are restricted to displaying vertical evolution. Phylogenetic networks are designed to represent more complex evolutionary relationships and are able to display horizontal evolutionary events, such as recombination in viruses. Several algorithms have been developed to construct such networks, but their suitability for coronavirus data is still unclear. In this presentation, I will shortly explain the concepts of viral recombination and phylogenetic networks and talk about the applicability of five phylogenetic network methods (TriLoNet, TriL2Net, Tree-Child Networks, Temporal Hybridization Number and Maximum-Pseudo Likelihood in PhyloNet) to a genomic data set of SARS-CoV-2 and related coronaviruses. I will show the influence of preprocessing steps (taxon selection, edge contraction and resolving multifurcations) and discuss the limitations of the different methods in terms of input size and consistency of the constructed networks. I will also discuss the biological interpretation of the resulting networks with respect to the evolutionary history of SARS-CoV-2.

Date: 6 October 2020

Speaker: Monika Grewal

Title: Developing a deep learning method for automatic organ segmentation

Abstract: Automatic organ segmentation can save hours of manual work required by clinicians in the process of radiotherapy treatment planning. Despite the availability of a plethora of deep learning methods for semantic segmentation and their surprisingly good performance on new unseen data, developing a deep learning method for a new problem is still full of challenges. At the bottom of these challenges is the need for a large annotated dataset for supervised learning of a neural network. In the medical imaging domain, the time and efforts required for building a large annotated dataset become even scarcer and costlier due to the underlying requirement of clinical expertise for annotating the data. Therefore, in order to avoid the need for manual annotation, we scraped a big dataset from the clinical database of a hospital along with the clinically available annotations. In this presentation, I will talk about the steps followed by us to develop a considerable size of fully annotated data suitable for supervised learning. I will also discuss the effects of different decisions related to sampling, preprocessing, and cleaning of data on the performance of baseline deep learning method for organ segmentation.

Date: 22 September 2020

Speaker: Tom den Ottelander

Title: What matters most for Neural Architecture Search: An analysis of search strategies, from simple to complex

Abstract: Computer vision tasks, like supervised image classification, are effectively tackled by convolutional neural networks, provided that the architecture, which defines the structure of the network, is set correctly. Neural Architecture Search (NAS) is a relatively young and increasingly popular field that is concerned with automatically optimizing the architecture of neural networks. Previously known work shows that even though recent works typically develop increasingly complex search strategies for NAS, several do not significantly outperform simple approaches like randomly sampling from the search space on single-objective NAS tasks. Additionally, proper ablation studies are often missing, therefore it is uncertain which mechanisms are fundamental for algorithms to achieve excellent NAS performances. In the first part of this thesis, Local Search (LS) and Uniform Size Random Search (USRS), are proposed for multi-objective (MO) NAS, demonstrating that very simple algorithms can provide searching performances close to state-of-the-art evolutionary algorithms (EAs), while outperforming random search. The second part explores what mechanisms are essential for the Multi-Objective Gene-pool Optimal Mixing Evolutionary Algorithm (MO-GOMEA), a state-of-the-art model-based EA, to achieve excellent performances for searching NAS spaces. The automatic population-sizing scheme of MO-GOMEA offers a welcome anytime-performance, but objective space clustering has only a small beneficial impact. The number of clusters can be set arbitrarily. Extreme clusters can be enabled to the practitioner’s preference, resulting in different search behaviors. The improved performance gained by automatic linkage learning is limited, although it can be helpful on future, more complex search spaces.

Date: 8 September 2020

Speaker: Michelle Sweering

Title: String Sanitization under Edit Distance: Improved and Generalized

Let W be a string of length n over an alphabet Σ, k be a positive integer, and S be a set of length-k substrings of W. The ETFS problem asks us to construct a string X_ED such that: (i) no string of S occurs in X_ED; (ii) the order of all other length-k substrings over Σ is the same in W and in X_ED; and (iii) X_ED has minimal edit distance to W. When W represents an individual's data and S represents a set of confidential patterns, the ETFS problem asks for transforming W to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019].

ETFS can be solved in O(n^2k) time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in O(n^{2−δ