For the first lab in 6.035, Computer Language Engineering, in Spring 2017, we use Flex, a lexical analyzer library, and Bison, a parser generator, to create a parser for a simple scripting language.
The grammar for the language is shown below.
1 | Program ::= Statement* |
The goal is to construct an Abstract Syntax Tree (AST) for any MITScript program and pretty-print it.
1 | Your parser must produce an AST with nodes for the following program constructs: |
Flex
How do Flex and Bison work? Much of the lab was reading documentation about these two old utilities.
Flex takes a .lex
file, which simply lists the regular expressions, tokens, and keywords that will be matched in a program.
When Flex works with Bison, every time it matches a token/keyword, it tells Bison about it. It can return the type of token/keyword to Bison, and it can also return an entire matched string or int in a Bison object it has access to, called yylval
.
Here are examples of both:
1 | {string_const} { |
For us, the lab specifies that: Your lexer must be able to recognize the following kinds of tokens in addition to all the keywords and operators listed above:
- integer constants consisting of one or more digits
1 | int_const [0-9][0-9]* |
- string constants wrapped in double quotes and supporting the following escaped characters: \ " \n \t
1 | string_const (\"(\\.|[^"])*\") |
- None constant, equivalent to “NULL” in Java
- ‘true’ and ‘false’
1 | "None" {return T_NONE;} |
- Name identifiers that start with a letter or underscore, followed by sequence of letters, underscores and numbers. so x0 is a valid variable name, but 0x is not
1 | identifier ([a-zA-Z_][a-zA-Z_0-9]*) |
The file is compiled in our Makefile with flex --outfile=lexer.cpp --header-file=lexer.h lexer.lex
.
Bison
In the Bison .yy
file, we must define several things. First, we must define the union, types, and tokens that Flex and our grammar know about. Then, we need to define our grammar (the most important part!).
The grammar section of the .yy
file contains rules created simply by converting our MITScript to Bison syntax (more on this soon). Each rule is associated with an action, which contains C++ code to be executed when the rule is matched.
How does this fit into everything else? Well, we have a main.cpp
file that will call yyparse
on stdin, which will match all the tokens and Bison rules, executing their actions. These actions will construct node objects of an AST, which are defined in AST.h
. Then, the main.cpp
file will call the pretty print function on the root node, which will recursively print the tree.
As an example of a rule + action,
1 | returnStmt: |
Above, the T_RETURN and T_SEMICOLON are tokens that correspond to “return” and “;” in our lex file.
The components matched by the rule can be accessed by $n
, which stands for the value of the nth component. The semantic value for the grouping being constructed is $$
.
The ReturnStatement object is an AST node class defined in another file, AST.h
, which we’ll get to.
But let me go back to the union. “The %union declaration specifies the entire collection of possible data types for semantic values.” Lexer can access these union fields in yylval
.
1 | %union { |
You can see that in my union, I have a bunch of pointers of node types in my AST.
I can now define %tokens
, which are terminals in my grammar (strings, keywords and ints), and %types
, which are nonterminals in my grammar. All the tokens are matched and returned in the .lex
file.
1 | %token<intconst> T_INT |
And nonterminals:
1 | %type <statementType> stmt assignment callStmt ifStmt whileLoop returnStmt global Program |
To enforce left-associativity, I add
1 | %left PLUS MINUS |
Now for the fun stuff! With all the types I’ve defined, I need to write grammar rules for all of them.
1 | Program: |
The highest level, the start Program, can be empty. We assign the Program node to out
, which is passed to main.cpp
through an argument of yyparse
. Note that the skeleton code added some extra stuff to make this necessary/possible; namely, it is needed to make the parser reentrant. The skeleton code constructs are explained in this great post.
Much of the translation from BNF grammar to Bison is straightforward. But, when creating variable-length lists, such as function arguments(fun(a b c){}
), we have to use a C++ vector
and recursively push objects into it.
1 | argument_expr_list: |
In the third rule above, the $3
matches the expr
, and $1
matches the argument_expr_list
, which you recall is of type vector<Expression*>
.
The same goes for record
rules, which are like Python dicts, and are of type map<string, Expression*>
:
1 | record: |
Note that the default action for a rule is $$ = $1;
.
Running bison with bison --output=parser.cpp --defines=parser.h -v parser.yy
will produce several files, including a parser.output
file that contains valuable debugging info about our parser, such as the state transition tables.
AST Nodes, Visitor pattern pretty-printing, and main (Or, putting it all together)
Here’s the main.cpp
file
1 | int main(int argc, char** argv){ |
Each class in my AST.h
inherits from Expression
or Statement
.
1 | class Call: public Expression { |
The accept
function that overrides a virtual method is part of the Visitor Pattern. Instead of defining a pretty-print method in each of my classes in AST.h
, I define a PrettyPrinter class that contains visit(Object& obj)
methods for all my nodes. We also need an interface with all our virtual methods in Visitor.h
.
This diagram explains the Visitor pattern well.
My PrettyPrinter.h
file has recursive calls to each node’s accept()
method. For example, for a Block, which contains a list of Statements,
1 | void visit(Block& blk) override { |
I use a global variable TABLEVEL
and printf("\n%s}", string(TABLEVEL-1,'\t').c_str());
to get the correct indentation for the body elements of the block.
When I print, I wrap all BinaryExpression
s and UnaryExpression
s in parentheses.
We can now understand the rest of our Makefile
,
1 | all: parser.cpp lexer.cpp main.cpp PrettyPrinter.h Visitor.h |
The compiled binary, a.out
, will take in a file through a redirect ./a.out < tests/good1.mit
and pretty-print!
Other references
Here is a complete compiler tutorial that is closer to a real language than most things. Most other Flex/Bison examples are stupid calculators x.x.
This tutorial is very very thorough.
The ANSI C Lex spec and Yacc (old version of Bison) file are helpful in constructing our own grammar and lex rules.