CWI RESEARCHERS DEVELOP NEW TECHNIQUES FOR GENERAL TOP-DOWN PARSING OF PROGRAMMING LANGUAGES

Afroozeh and Izmaylova, PhD students at CWI's Software Analysis and Transformation group, present new techniques for declarative parsing of programming languages in their thesis "Practical Genereal Top-down Parsers."

Publication date
7 Jun 2019

Despite the long history of research in parsing, constructing parsers for real programming languages remains a difficult and painful task. In the last decades, different parser generators emerged to allow the construction of parsers from a BNF-like specification. However, still today, many parsers are handwritten, or are only partly generated, and include various hacks to deal with different peculiarities in programming languages. In their thesis, Ali Afroozeh and Anastiasia Izmaylova, PhD students at the Software Analysis and Transformation (SWAT) group of Centrum Wiskunde & Informatica (CWI), present new techniques for declarative parsing of programming languages. They will defend their PhD thesis "Practical General Top-down Parsers" at the University of Amsterdam on June 11, 2019.

Top-down parsing, usually in the form of recursive-descent parsing, is efficient, easy to understand and is widely used to manually write parsers for real programming languages. However, due to the difficulty of dealing with left recursion, top-down parsers always suffered from expressiveness problems compared to their bottom-up counterparts.

This thesis presents the design and implementation of practical general top-down parsers. The work in this thesis has been inspired by the Generalized LL (GLL) parsing algorithm and Johnson's CPS recognizers, which both generalize recursive-descent parsing to support all context-free grammars, including left-recursive ones. In fact, an important contribution of this thesis is the merging and unifying of GLL and Johnson's CPS recognizers.

The main research focus has been to realize general top-down parsers that are practical for parsing real programming languages. Since a general parser without disambiguation is not useful for parsing real programming languages, which almost always should be unambiguous, the authors have spent considerable time on disambiguation, especially operator precedence disambiguation. The contributions of this thesis can be summarized as follows:

- Algorithmic and implementation optimizations of the GLL parsing algorithm.

- Extension of the GLL parsing algorithm to support data-dependent grammars, to deal with languages that are not context-free and to provide a generic framework for implementing disambiguation constructs.

- Semantics and an efficient implementation for operator precedence disambiguation.

- General parser combinators, providing a viable alternative for users who would like to have a general parser directly embedded in a programming language such as Scala.

The authors have developed Iguana, a data-dependent parsing framework based on GLL. The results of the performance evaluation show that Iguana is practical for parsing real programming languages. The authors have also developed Meerkat, a general parser combinator library in Scala.

The research was supervised by Prof. Paul Klint and Prof. Jurgen Vinju.

More information about the defense can be found on the UvA website (in Dutch).

Link to thesis