Algorithms and Data Structures for Sequence Analysis

In this subgroup of the CWI Life Sciences and Health group, led by Dr. Solon Pissis, we are working on algorithms and data structures for sequence analysis.

In this subgroup of the CWI Life Sciences and Health group, led by Dr. Solon Pissis, we are working on algorithms and data structures for sequence analysis.

Research topics we are working on include algorithms and data structures on strings for pattern matching, indexing, comparison, and finding regularities. These topics are collectively known as combinatorial pattern matching. Combinatorial pattern matching methods are typically the workhorse of  computational molecular biology applications, where the raw data to be analysed are DNA, RNA or protein sequences. We are interested in such applications but also in other applications processing sequential data, such as data mining, data compression, and information retrieval.

Combinatorial pattern matching

Let us start by providing a brief overview of the combinatorial pattern matching topics.

In pattern matching, we are given a string x to pre-process in time that is proportional to the length of x, so that when we are given another longer string y, we can find all occurrences of x in y in time that is proportional to the length of y. In the following example we see an approximate occurrence of string x in string y.

In indexing, we are given a string y to pre-process into a data structure in time that is proportional to the length of y, so that when we are given another shorter string x, we can find occurrences of x in y in time that is proportional to the length of x. In the following example we see an indexing data structure for string y = ababbb.

In sequence comparison, we are given two (or more) strings and the task is to compare them in order to infer or visualize their similarities. In the following example we see an alignment for two strings: ACGA and ATGCTA.

In finding regularities, we are given a string x and the task is to find certain types of patterns repeating in x. In the following example we see a few different patterns repeating in string x = abaabbabbaaabbabbba.

Computational biology

Let us now mention a few applications of combinatorial pattern matching in computational molecular biology:

  • The first step of genome assembly via read mapping is to construct an indexing data structure over the reference genome. The first step of de novo genome assembly is to find common substrings between the reads to construct a de Bruijn or an overlap graph.

  • The first step of inferring evolutionary relationships between sequences is the pairwise or multiple comparison of these sequences. The output of this step is used as the input to methods that reconstruct phylogenetic trees to depict these relationships.

  • Inferring functional relationships between sequences is performed via comparing a sequence to a sequence database to detect regions of significant similarity or via extracting patterns (motifs), which are significantly overrepresented in a set of sequences.

Data mining

Combinatorial pattern matching methods are also applied in data mining, when the data to be analysed are textual (sequential). For example, a string can represent the movement history of an individual, with each letter corresponding to a visited location; or an individual's purchasing history in a retailer, with each letter corresponding to a purchased product. Let us mention a few  such applications:

  • In pattern mining the task is to extract actionable patterns from large textual datasets; e.g. to extract the most frequent or the most significant patterns.

  • In data sanitization the task is to disguise confidential patterns in large textual datasets while preserving the utility of these datasets.
  • In document clustering the task is to compute term frequencies to reflect how important a term is to a document in a large collection of documents.

Researchers at CWI Life Sciences and Health research group

Recent related research outputs