Supervisors: Ben Bals and Solon P. Pissis
CWI, Amsterdam, The Netherlands
Vrije Universiteit, Amsterdam, The Netherlands
Your Contribution: The core of this project will consist of you implementing our improved algorithm as well as previous algorithms. The goal is to test the theoretical advantages of the novel algorithm in practice and compare it to previous approaches. Additionally, we see seek to better understand how real-world data (i.e., real genomes) behave (as opposed to a theoretical worst-case analysis).
CWI, Amsterdam, The Netherlands
Vrije Universiteit, Amsterdam, The Netherlands
Recently, our research group has developed a theoretically faster algorithm for string assembly based on Eulerian trails in a special class of graphs (so called de Bruijn graphs) [1]. In string assembly, we receive fragments (i.e.,strings of length π). Our goal now is to find a larger string such that each fragment maps to a unique substring of that larger string. In the variation of this classical problem that we consider, we also have access to additional information on where each fragment can appear in the assembled string. These kinds of algorithms lay at the heart of genome assembly and also have applications in differential privacy.
You should have solid skills in efficient programming (e.g., good practical grasp of standard data structures and memory management). Prior knowledge of C++ or Rust is advantageous. A good knowledge of common string data structures and algorithms may help, but is not required. Prior knowledge in bioinformatics or graph theory is not required.
Setup
Research Questions
If the code is designed correctly, this should be possible by making relatively few modifications to our algorithm.
[1] Ben Bals, Sebastiaan van Krieken, Solon P. Pissis, Leen Stougie, and Hilde Verbeek. When is string reconstructionusing de Bruijn graphs hard? Private communication, 2025.
[2] Amir Ben-Dor, Itsik Peβer, Ron Shamir, and Roded Sharan. On the complexity of positional sequencing by hybridization. J. Comput. Biol., 8(4):361β371, 2002.doi:10.1089/106652701752236188.
[3] Alessio Conte, Roberto Grossi, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, and Giulia Punzi. Beyond the BEST theorem: Fast assessment of Eulerian trails. In Evripidis Bampis and Aris Pagourtzis, editors, Fundamentals of Computation Theory - 23rd International Symposium, FCT 2021, Athens, Greece, September 12-15, 2021, Proceedings, volume 12867 of Lecture Notes in Computer Science, pages 162β175. Springer, 2021. doi:10.1007/978-3-030-86593-1\_11.
[4] National Center for Biotechnology Information. Escherichia coli. In National Library of Medicine. URL: https://www.ncbi.nlm.nih.gov/datasets/taxonomy/562/.
[5] National Center for Biotechnology Information. Homo sapiens. In National Library of Medicine. URL: https://www.ncbi.nlm.nih.gov/datasets/taxonomy/9606/.