Many major, modern compilers use hand-written recursive descent parsers. While parser generators like Yacc and Bison were popular in the past, a hand-written approach has become the preferred method for many production-level compilers.
Examples of major compilers using recursive descent
- GCC: The GNU Compiler Collection once used a parser generated by Bison but switched all its front-ends, including those for C and C++, to hand-written recursive descent in the mid-2000s.
- Clang: The Clang compiler, which uses LLVM, has a hand-written recursive descent parser for C, C++, and Objective-C.
- Rust: The Rust compiler’s parser is hand-written.
- V8: The JavaScript engine in Chrome, V8, uses a hand-written recursive descent parser.
- Roslyn: The .NET Compiler Platform (Roslyn) for C# uses a hand-written recursive descent parser.
Reasons for preferring hand-written parsers
- Superior error reporting: Hand-written parsers offer fine-grained control over error messages, allowing for more specific and helpful feedback to the user. This is a critical feature for a good developer experience.
- Handling complex or ambiguous grammars: Modern languages, especially C++, can have highly complex and ambiguous grammars that traditional parser generators like LALR (which Yacc and Bison use) struggle with. A hand-written parser can use ad-hoc methods to handle these tricky cases.
- Greater flexibility: Hand-written parsers are more adaptable and can be easily integrated with other compiler phases, like the symbol table, to handle context-dependent aspects of a language. This flexibility is important for language evolution and for supporting advanced features like incremental compilation.
- Bootstrap simplicity: Hand-written parsers do not introduce an external build-time dependency on a separate tool. For many language projects, it is simpler to write the parser directly in the language being compiled as part of the bootstrapping process.
- Competitive performance: While generated parsers can be fast, a well-implemented recursive descent parser can offer comparable or better performance. The performance difference is often negligible compared to the rest of the compilation process.
Potential drawbacks
- Increased development effort: Hand-writing a parser can be more complex and time-consuming than using a generator for a simple grammar.
- Risk of implementation errors: Since the parser code is written manually, it can contain bugs that could be avoided by using a generator with a formally verified grammar.
For most modern production-level compilers, the trade-off of greater developer control and better diagnostics is worth the extra development effort of writing a recursive descent parser by hand.
Parser generators¶
Despite the industry trend toward hand-written parsers, parser generators remain a powerful and fascinating area of study. My experience began with classic tools such as yacc, bison, lex, and flex, and has since expanded to more modern generators like antlr, tatsu, egg, rust-peg, pigeon, and pegen. This exploration covers a spectrum of design philosophies: some generators enforce a separation of concerns by deferring logic to a post-parsing AST walker, while others allow for semantic actions to be embedded directly within the grammar specification (EBNF). Furthermore, these tools vary in their approach to AST construction; some automatically generate an AST from the grammar, others require manual construction via embedded logic, and a few utilize formalisms like the Abstract Syntax Definition Language (ASDL).
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator that generates a language recognizer from a grammar you provide. While both ANTLR 3 and ANTLR 4 fulfill this purpose, they feature fundamental differences in their parsing strategy, how they handle grammar rules, and how they separate grammar logic from application code.
ANTLR 3
Parsing algorithm: Uses a static, greedy LL() algorithm that determines its lookahead depth during parser generation. This required developers to manually fix certain grammar issues, such as left recursion, to make them palatable for the tool. It also supported auto-backtracking as an option for grammars that were too complex for the static LL() analysis. Grammar development: Allowed embedding custom code, known as “actions,” directly inside the grammar rules. This could mix parsing logic with business logic and was useful for building Abstract Syntax Trees (ASTs). Abstract Syntax Tree (AST) construction: Supported building custom ASTs with specific tree rewrite rules (->) and AST operators (^, !) within the grammar. This gave developers direct control over the structure of the output tree. Tree grammars: Included specific tree grammars that could be used to process the custom ASTs created by the parser. IDE support: Was bundled with ANTLRWorks, a graphical development environment that included an editor, an interpreter for rapid prototyping, and a debugger. ANTLR 4
...
The Abstract Syntax Definition Language (ASDL) is a domain-specific language designed to describe the tree-like data structures that represent a program’s abstract syntax, known as Abstract Syntax Trees (ASTs). It was created to solve the problem of separating the logical structure of a program (the AST) from the concrete syntax of the programming language. ASDL was introduced as part of the Zephyr compiler infrastructure project at Princeton University and the University of Wisconsin in 1997. ASDL specifications are fed into a generator that automatically produces the necessary data types for a given programming language, such as C, C++, Java, or Python. ...
There is a variety of PEG (Parsing Expression Grammar) parser generators are available for Python, Rust, and C++. They differ primarily in how they handle actions, AST generation, and error handling. Python
Generator Key Features Pegen Embedded Code: Supports embedding Python code directly into the grammar file, known as “grammar actions,” to build the AST. AST/CST: Directly generates AST objects based on the grammar actions. The current CPython interpreter is built using Pegen. Semantic Predicates: The grammar actions, which can execute arbitrary Python expressions, effectively function as semantic predicates, allowing context-sensitive parsing. Parsimonious Embedded Code: Does not embed code directly in the grammar. The grammar defines the structure, and you traverse the resulting concrete syntax tree (CST) with a visitor or a custom transformation class. AST/CST: Generates a CST (a full parse tree) from the input. It is up to the developer to then process this tree into a more abstract representation (AST). Semantic Predicates: Does not support semantic predicates directly in the grammar. You would perform semantic checks during the post-parsing tree traversal. Fastidious Embedded Code: Aims to support complex rule actions but primarily uses a separate compilation step. AST/CST: Plans to support automatic AST node generation. Semantic Predicates: Not specified in the documentation, suggesting it is handled outside the grammar rules. Rust
...