Multiple structural alignment of RNA and proteins
Research group: Life Sciences
- March 2007 - August 2008 (FU Berlin)
- September 2008 - August 2012 (CWI Amsterdam)
- Funding: March 2007 - September 2011: Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) DFG KL 1390/2-1; September 2011 - August 2012: Centrum Wiskunde & Informatica
RNA Structural Alignment
During the last years, the number of genes known for producing non-coding functional RNA has increased significantly, and it is assumed that many more of these ncRNA genes are still undiscovered. Yet, many functional classes of RNA show little sequence conservation, but rather a conserved secondary structure. A promising avenue to detecting new functional RNAs is thus to look for common structural features shared by a set of related sequences. Within this project we develop algorithmic theory to compute reliable multiple structural alignments of potentially long RNA molecules under a reasonable consumption of computational resources. We employ methods from mathematical programming such as Lagrangian relaxation and solve the problem as an integer linear program resulting from a graph-theoretical reformulation. Our results for the case of pairwise structural alignment show that our current software prototype is among the top programs in terms of speed and alignment quality.
Due to the similiarities of RNA secondary structures and contact map representations of proteins and based on the successful application of the above techniques in the field of RNA structural biology, we have extended the project to the area of protein structure research.
Protein Structural Alignment
Proteins are the molecular machines of the cell, carrying out many diverse functions. A protein’s particular function usually follows from its 3-dimensional structure, a relationship known as the structure-function paradigm.
A protein is a folded chain of amino acid residues connected by peptide bonds. Its structure has a regular backbone and different side chains which depend on the respective amino acid type. There are four levels of protein structure: primary, secondary, tertiary and quarternary structure. Primary structure is the sequence of amino acid residues. Secondary structure comprises regular elements such as α-helices and β-sheets. Tertiary structure is determined by the precise location of every atom in the structure. Finally, quarternary structure is an assembly of several amino acid chains.
Structural similarities between proteins are extremely common. First, there is only a limited number of overall structural arrangements, or so-called protein folds. Then, evolutionary entities, so-called domains, re-appear in many different proteins, because they are deleted, inserted, swapped, mutated, etc. Further, there is a high evolutionary pressure on conserving protein function and hence on conserving protein structure during evolution. Proteins which share a common ancestor, i.e., homologs, thus often have a similar structure. Finally, similar function may lead to similar, convergent structures.
Because of this structure-function relationship, protein similarities help us to learn about the evolution and biological tasks of proteins. Therefore we would like to reliably detect such structural similarities. For this, we resort to experimental data which provides the 3D location of every atom in the protein structure. Two proteins are considered structurally similar if they have a similar protein backbone conformation. We thus select only one representative atom of every amino acid residue. The resulting two chains of representative atoms are then compared. From such a comparison we obtain a sequential one-to-one mapping between structurally equivalent residues in the two proteins. This is called a structure alignment. Further, we can obtain a similarity score for a pair of proteins. Given an alignment we can also superpose the two structures in 3-dimensional space (see illustration).
Using optimization, we aim to detect the best structure alignment. Two steps are important. First, we need to define a scoring scheme according to which biologically correct alignments are top-scoring. Then, we use an algorithm that finds the score-optimal alignment. For reasonable scoring schemes, finding the optimal structure alignment is difficult; it is an NP- hard problem. As a result, almost all of the many existing structure alignment algorithms are heuristics. Even worse, each heuristic uses its own scoring scheme. Currently, there is no consensus which algorithm or scoring scheme is best.
Our contribution to the structure alignment problem is two-fold. First, we cast the problem in mathematical models and design exact algorithms to solve it. To this end we formulate general integer linear programs for inter-residue distance matrix-based structure alignment. These models cast many existing structure alignment approaches into a common framework. Finally, we solve these models to optimality by designing exact algorithms.
An inter-residue distance matrix alignment is a sequential assignment of a subset of distance matrix rows and columns from one matrix to a subset of distance matrix rows and columns from the other matrix. This assignment should maximize the overall score for paired inter- residue distances. An exact algorithm for this problem will either return a mapping of maximum score or, if not found within time limit, bounds on this maximum score. We apply techniques from combinatorial optimization for our exact algorithms: integer linear programming, Lagrangian relaxation, branch-and-bound and branch-and-cut.
Our second contribution to the structure alignment problem is the application of our algorithms to problems that can only be tackled by the use of exact algorithms. For example, we compute provably better alignments, benchmark heuristics, obtain alignment quality guarantees, assess protein similarity, or rigorously compare scoring schemes for protein structure alignment. Finally, we provide all these services to structural biologists via our web server CSA.
- I Wohlers. Exact algorithms for pairwise protein structure alignment. PhD thesis, Vrije universiteit Amsterdam, Netherlands, 2012
- I Wohlers, R Andonov, and GW Klau. Optimal DALI protein structure alignment. Rapport de recherche RR-7915, INRIA, 2012. Submitted.
- I Wohlers, N Malod-Dognin, R Andonov, and GW Klau. CSA: Comprehensive comparison of pairwise protein structure alignments. Nucleic Acids Research 2012; doi: 10.1093/nar/gks362.
- A Mucherino, I Wohlers, GW Klau, R Andonov. Sparsifying Distance Matrices for Protein-Protein Structure Alignments. Extended abstract, to be presented at 10th Cologne-Twente Workshop on Graphs and Combinatorial Optimization. Frascati, Italy, 14-16 June 2011.
- I Wohlers, R Andonov, and GW Klau. Algorithm Engineering for Optimal Alignment of Protein Structure Distance Matrices, Optimization Letters 5(3):421-433, 2011
- I Wohlers, FS Domingues, GW Klau. Towards optimal alignment of protein structure distance matrices. Bioinformatics 26(18):2273-2280, 2010.
- I Wohlers, L Petzold, FS Domingues, GW Klau. PAUL: protein structural alignment using integer linear programming and Lagrangian relaxation. Selected abstract. BMC Bioinformatics 2009, 10(Suppl 13):P2
- I Wohlers, L Petzold, FS Domingues, GW Klau. Aligning protein structures using distance matrices and combinatorial optimization. In Proc. of GCB 2009 (German Conference on Bioinformatics, Halle (Saale), Germany, 28-30 Sep 2009), I. Grosse et al. (eds.), Lecture Notes in Informatics, 2009
- I Wohlers. Multiple Alignment of Protein Distance Matrices with Mathematical Programming, MSc thesis, Freie Universität Berlin, 2008.
- M Bauer. A Combinatorial Approach to RNA Sequence-Structure Alignments. PhD thesis, Freie Universität Berlin, Germany, 2008.
- M Bauer, GW Klau, and K Reinert. An exact mathematical programming approach to multiple RNA sequence-structure alignment. Algorithmic Operations Research, Vol. 3, No. 2, 2008. Special Issue on Biology, Medicine, and Health Care.
- M Bauer, GW Klau, and K. Reinert. Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization. BMC Bioinformatics, 8(271), 2007.
- K Reinert, M Bauer, A Döring, GW Klau, and AL Halpern. A general paradigm for fast, adaptive clustering of biological sequences. In C. Falter et al., editors, Proc. of the German Conference on Bioinformatics (GCB 2007), Potsdam, Germany, volume 115 of GI-Edition, Lecture Notes in Informatics, pages 15-29. Gesellschaft für Informatik, 2007.
- Algorithmic Bioinformatics group, Freie Universität Berlin. (Knut Reinert)
- INRIA Rennes, team SYMBIOSE (Rumen Andonov, Noël Malod-Dognin, Mathilde Le Boudic-Jamin)