Abstract Alexander Schönhuth

CLEVER & SMART: Clique-Enumerating Variant Finder & Split-Read Mixture-based Ambiguity Resolving Tool

CLEVER & SMART: Clique-Enumerating Variant Finder & Split-Read Mixture-based Ambiguity Resolving Tool



Next-generation sequencing techniques have facilitated a large scale analysis of human genetic variation. Despite the advances in sequencing speeds, and despite that many approaches have already been presented, the computational discovery of structural variants is not yet standard. Although next-generation sequencing data exhibits many systematic biases and errors, unifying statistical frameworks for addressing this have hardly been presented.


We address this issue and provide two statistical framework based approaches for next-generation sequencing data based prediction of structural variants in human genomes.

CLEVER aims at harnessing the (well-understood) statistics inherent to the length of the sequenced genomic fragments (reads). It organizes all reads into a read alignment graph. In this graph, max-cliques represent maximal contradiction-free groups of fragments, that is groups of genomic fragments that stem from the same region. A specifically engineered algorithm finds all max-cliques and evaluates them for their potential to reflect structural variations, based on the statistics on fragment length.

SMART is a so called split-read based approach and addresses to resolve the issue of ambiguously mapped reads (multi-reads). It computes so called split-read alignments of reads, which reveals variant breakpoints in the genome under investigation. It resolves the issue of multi-reads by way of a mixture model based expectation-maximization (EM) procedure. The resulting maximum likelihood estimate encodes the most likely assignment of multi-reads to genomic regions.


We compare a large range of state-of-the-art approaches using a fully annotated (Craig Venter's) genome and present various relevant performance statistics. Among the insert size based approaches, CLEVER achieves superior performance rates in particular on indels of sizes 20–100, which have been exposed as a current major challenge in the SV discovery literature and where insert size based approaches were supposed to have limitations. SMART yields further improvements in terms of both recall and precision and also because its predictions are ultra-precise in terms of breakpoint accuracy, even for long indels (100 - 50000 bp), thereby outperforming all current state-of-the-art approaches on SV discovery.