Algorithms and Data Structures for Sequence Analysis

In this subgroup of the CWI Life Sciences group, led by Dr. Solon Pissis, we are working on algorithms and data structures for sequence analysis.

In this subgroup of the CWI Life Sciences group, led by Dr. Solon Pissis, we are working on algorithms and data structures for sequence analysis.

Research topics we are working on include algorithms and data structures on strings for pattern matching, indexing, comparison, and finding regularities. These topics are collectively known as combinatorial pattern matching. Combinatorial pattern matching methods are typically the workhorse of  computational molecular biology applications, where the raw data to be analysed are DNA, RNA or protein sequences. We are interested in such applications but also in other applications processing sequential data, such as data mining, data compression, and information retrieval.

Let us start by providing a brief overview of the combinatorial pattern matching topics.

In pattern matching, we are given a string x to pre-process in time that is proportional to the length of x, so that when we are given another longer string y, we can find all occurrences of x in y in time that is proportional to the length of y. In the following example we see an approximate occurrence of string x in string y.

In indexing, we are given a string y to pre-process into a data structure in time that is proportional to the length of y, so that when we are given another shorter string x, we can find occurrences of x in y in time that is proportional to the length of x. In the following example we see an indexing data structure for string y = ababbb.

In sequence comparison, we are given two (or more) strings and the task is to compare them in order to infer or visualize their similarities. In the following example we see an alignment for two strings: ACGA and ATGCTA.

In finding regularities, we are given a string x and the task is to find certain types of patterns repeating in x. In the following example we see a few different patterns repeating in string x = abaabbabbaaabbabbba.

Let us now mention a few applications of combinatorial pattern matching in computational molecular biology:

  • The first step of genome assembly via read mapping is to construct an indexing data structure over the reference genome. The first step of de novo genome assembly is to find common substrings between the reads to construct a de Bruijn or an overlap graph.

  • The first step of inferring evolutionary relationships between sequences is the pairwise or multiple comparison of these sequences. The output of this step is used as the input to methods that reconstruct phylogenetic trees to depict these relationships.

  • Inferring functional relationships between sequences is performed via comparing a sequence to a sequence database to detect regions of significant similarity or via extracting patterns (motifs), which are significantly overrepresented in a set of sequences.

Some of our recent related research outputs are:

  1. Lorraine A. K. Ayad, Solon P. Pissis, Dimitris Polychronopoulos: CNEFinder: finding conserved non-coding elements in genomes. Bioinformatics 34(17): i743-i747 (2018)
  2. Lorraine A. K. Ayad, Solon P. Pissis: MARS: improving multiple circular sequence alignment using refined sequences. BMC Genomics 18(1), 86 (2018)
  3. Alice Héliou, Solon P. Pissis, Simon J. Puglisi: emMAW: computing minimal absent words in external memory. Bioinformatics 33(17): 2746-2749 (2017)

Combinatorial pattern matching methods are also applied in data mining, when the data to be analysed are textual (sequential). Let us mention a few  such applications:

  • In frequent pattern mining the task is to extract the most frequent and relevant patterns from large textual datasets.

  • In data sanitization the task is to disguise confidential patterns in large textual datasets while preserving the utility of these datasets.
  • In document clustering the task is to compute term frequencies to reflect how important a term is to a document in a large collection.

Researchers at CWI Life Sciences and Health research group