Back

Flex and Bison for a Simple Language, MITScript

For the first lab in 6.035, Computer Language Engineering, in Spring 2017, we use Flex, a lexical analyzer library, and Bison, a parser generator, to create a parser for a simple scripting language.

The grammar for the language is shown below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Program ::= Statement*
Statement ::= Assignment | CallStatement | Global | IfStatement | WhileLoop | Return
Global ::= 'global' name ';'
Assignment ::= LHS '=' Expression ';'
CallStatement ::= Call ';'
Block ::= '{' Statement* '}'
IfStatement ::= 'if' '(' Expression ')' Block ( 'else' Block )?
WhileLoop ::= 'while' '(' Expression ')' Block
Return ::= 'return' Expression ';'
Expression ::= Function | Boolean | Record
Function ::= 'fun' '(' Name* ')' Block
Boolean ::= Conjunction ( '|' Conjunction )*
Conjunction ::= BoolUnit ('&' BoolUnit)*
BoolUnit ::= '!'? Predicate
Predicate ::= Arithmetic ( ('<' | '>' | '<=' | '>='| '==') Arithmetic)?
Arithmetic ::= Product ( ('+' | '-') Product)*
Product ::= Unit ( ('*' | '/') Unit)*
Unit ::= '-'? (LHS | Constant | Call | '(' Boolean ')' )
LHS ::= Name ('.' Name | '[' Expression ']' )*
Call ::= LHS '(' (Expression (',' Expression)*)? ')'
Record ::= '{' (Name ':' Expression ';')* '}'
Constant ::= integer_constant | string_constant

The goal is to construct an Abstract Syntax Tree (AST) for any MITScript program and pretty-print it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Your parser must produce an AST with nodes for the following program constructs:
Block ::= [Statement]
Global ::= name
Assignment ::= LHS Expression
ExpressionStatement ::= Expression ';'
IfStatement ::= Condition ThenPart ElsePart
WhileLoop ::= Condition Body
Return ::= Expression
FunctionDeclaration ::= [Arguments] Body
BinaryExpression ::= LeftOperand Operator RightOperand
UnaryExpression ::= Operand Operator
FieldDereference ::= BaseExpression Field
IndexExpression ::= BaseExpression Index
Call ::= TargetExpression [Arguments]
Record ::= Map[String, Expression]
IntegerConstant
StringConstant
NoneConstant

Flex

How do Flex and Bison work? Much of the lab was reading documentation about these two old utilities.

Flex takes a .lex file, which simply lists the regular expressions, tokens, and keywords that will be matched in a program.

When Flex works with Bison, every time it matches a token/keyword, it tells Bison about it. It can return the type of token/keyword to Bison, and it can also return an entire matched string or int in a Bison object it has access to, called yylval.

Here are examples of both:

1
2
3
4
5
6
7
8
{string_const} {
//Rule for string constant
yylval->strType = new string(yytext);
return T_STRINGCONST;
}

"while" {return T_WHILE;}

For us, the lab specifies that: Your lexer must be able to recognize the following kinds of tokens in addition to all the keywords and operators listed above:

  • integer constants consisting of one or more digits
1
int_const [0-9][0-9]*
  • string constants wrapped in double quotes and supporting the following escaped characters: \ " \n \t
1
string_const (\"(\\.|[^"])*\")
  • None constant, equivalent to “NULL” in Java
  • ‘true’ and ‘false’
1
2
3
"None" {return T_NONE;}
"true" {return T_TRUE;}
"false" {return T_FALSE;}
  • Name identifiers that start with a letter or underscore, followed by sequence of letters, underscores and numbers. so x0 is a valid variable name, but 0x is not
1
2
3
4
5
6
7
8
identifier ([a-zA-Z_][a-zA-Z_0-9]*)
.
.
.
{identifier} {
yylval->strType = new string(yytext);
return IDENTIFIER;
}

The file is compiled in our Makefile with flex --outfile=lexer.cpp --header-file=lexer.h lexer.lex.

Bison

In the Bison .yy file, we must define several things. First, we must define the union, types, and tokens that Flex and our grammar know about. Then, we need to define our grammar (the most important part!).

The grammar section of the .yy file contains rules created simply by converting our MITScript to Bison syntax (more on this soon). Each rule is associated with an action, which contains C++ code to be executed when the rule is matched.

How does this fit into everything else? Well, we have a main.cpp file that will call yyparse on stdin, which will match all the tokens and Bison rules, executing their actions. These actions will construct node objects of an AST, which are defined in AST.h. Then, the main.cpp file will call the pretty print function on the root node, which will recursively print the tree.

As an example of a rule + action,

1
2
3
4
returnStmt:
T_RETURN expr T_SEMICOLON {
$$ = new ReturnStatement(*$2);
};

Above, the T_RETURN and T_SEMICOLON are tokens that correspond to “return” and “;” in our lex file.

The components matched by the rule can be accessed by $n, which stands for the value of the nth component. The semantic value for the grouping being constructed is $$.

The ReturnStatement object is an AST node class defined in another file, AST.h, which we’ll get to.

But let me go back to the union. “The %union declaration specifies the entire collection of possible data types for semantic values.” Lexer can access these union fields in yylval.

1
2
3
4
5
6
7
8
9
10
11
12
13
%union {
int intconst;
string *strType; // goes with string_const in lex

Block* blockType;
Statement* statementType;
Expression* expressionType;

vector<Expression*> *exprListType;
vector<Statement*> *stmtListType;
vector<string*> *stringListType; //maybe don't need for function declaration list
map<string, Expression*> *recordMapType;
}

You can see that in my union, I have a bunch of pointers of node types in my AST.

I can now define %tokens, which are terminals in my grammar (strings, keywords and ints), and %types, which are nonterminals in my grammar. All the tokens are matched and returned in the .lex file.

1
2
3
4
5
6
%token<intconst> T_INT

%token<strType> T_SEMICOLON T_LBRACKET T_RBRACKET T_LPAREN T_RPAREN T_AND T_OR T_EXCLAM T_LESS_THAN T_GREATER_THAN T_LEQ T_GEQ T_EQEQ T_PLUS T_MINUS T_TIMES T_DIV T_DOT T_EQUALS T_COLON T_LBRACE T_RBRACE T_COMMA

%token <strType> T_FUNCTION T_GLOBAL T_IF T_ELSE T_WHILE T_RETURN T_NONE T_TRUE T_FALSE
%token <strType> IDENTIFIER T_STRINGCONST

And nonterminals:

1
2
3
4
5
6
7
8
9
10
11
%type <statementType> stmt assignment callStmt ifStmt whileLoop returnStmt global Program

%type <blockType> block

%type<stmtListType> stmts

%type <expressionType> expr call boolean function conjunction boolunit predicate arithmetic product unit lhs record constant subunit

%type<exprListType> argument_expr_list func_declaration_list

%type<recordMapType> record_init_list

To enforce left-associativity, I add

1
2
%left PLUS MINUS
%left TIMES DIV

Now for the fun stuff! With all the types I’ve defined, I need to write grammar rules for all of them.

1
2
3
4
5
6
7
8
Program:
%empty {
// printf("Empty program\n");
}
| stmts {
$$ = new Program(*$1);
out = $$;
};

The highest level, the start Program, can be empty. We assign the Program node to out, which is passed to main.cpp through an argument of yyparse. Note that the skeleton code added some extra stuff to make this necessary/possible; namely, it is needed to make the parser reentrant. The skeleton code constructs are explained in this great post.

Much of the translation from BNF grammar to Bison is straightforward. But, when creating variable-length lists, such as function arguments(fun(a b c){}), we have to use a C++ vector and recursively push objects into it.

1
2
3
4
5
6
7
8
9
10
11
argument_expr_list:
%empty {
$$ = new ExpressionList();
}
| expr {
$$ = new ExpressionList();
$$->push_back($1);
}
| argument_expr_list T_COMMA expr {
$1->push_back($3);
};

In the third rule above, the $3 matches the expr, and $1 matches the argument_expr_list, which you recall is of type vector<Expression*>.

The same goes for record rules, which are like Python dicts, and are of type map<string, Expression*>:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
record:
T_LBRACE record_init_list T_RBRACE {
// printf("Parsing\n ");
$$ = new Record(*$2);
};

record_init_list:
%empty {
// printf("Parsing empty record\n ");
$$ = new RecordMap();
}
| record_init_list IDENTIFIER T_COLON expr T_SEMICOLON {
// printf("Parsing record recursive\n ");
$1->insert(make_pair(*$2, $4));
};

Note that the default action for a rule is $$ = $1;.

Running bison with bison --output=parser.cpp --defines=parser.h -v parser.yy will produce several files, including a parser.output file that contains valuable debugging info about our parser, such as the state transition tables.

AST Nodes, Visitor pattern pretty-printing, and main (Or, putting it all together)

Here’s the main.cpp file

1
2
3
4
5
6
7
8
9
10
11
12
13
int main(int argc, char** argv){
void* scanner;
yylex_init(&scanner);
yyset_in(stdin, scanner);
Statement* output;
int rvalue = yyparse(scanner, output);
if(rvalue == 1){
cout<<"Parsing failed"<<endl;
return 1;
}
PrettyPrinter printer;
output->accept(printer);
}

Each class in my AST.h inherits from Expression or Statement.

1
2
3
4
5
6
7
8
9
10
class Call: public Expression {
public:
Expression& targetExpr;
ExpressionList& argumentsList;
Call( Expression& targetExpr, ExpressionList& argumentsList): targetExpr(targetExpr), argumentsList(argumentsList){};
void accept(Visitor& v) override {
v.visit(*this);
}
};

The accept function that overrides a virtual method is part of the Visitor Pattern. Instead of defining a pretty-print method in each of my classes in AST.h, I define a PrettyPrinter class that contains visit(Object& obj) methods for all my nodes. We also need an interface with all our virtual methods in Visitor.h.

This diagram explains the Visitor pattern well.

My PrettyPrinter.h file has recursive calls to each node’s accept() method. For example, for a Block, which contains a list of Statements,

1
2
3
4
5
6
7
8
9
10
11
12
void visit(Block& blk) override {
PrettyPrinter printer;

printf("{");
TABLEVEL++;
for ( auto statement = blk.statements.begin(); statement != blk.statements.end(); statement++ ) {
printf("\n");
(*statement)->accept(printer);
}
printf("\n%s}", string(TABLEVEL-1,'\t').c_str());
TABLEVEL--;
};

I use a global variable TABLEVEL and printf("\n%s}", string(TABLEVEL-1,'\t').c_str()); to get the correct indentation for the body elements of the block.

When I print, I wrap all BinaryExpressions and UnaryExpressions in parentheses.

We can now understand the rest of our Makefile,

1
2
3
4
5
6
7
8
9
all: parser.cpp lexer.cpp main.cpp PrettyPrinter.h Visitor.h
g++ -g -std=gnu++11 main.cpp parser.cpp lexer.cpp


parser.cpp: parser.yy
bison --output=parser.cpp --defines=parser.h -v parser.yy

lexer.cpp: lexer.lex
flex --outfile=lexer.cpp --header-file=lexer.h lexer.lex

The compiled binary, a.out, will take in a file through a redirect ./a.out < tests/good1.mit and pretty-print!

Other references

Here is a complete compiler tutorial that is closer to a real language than most things. Most other Flex/Bison examples are stupid calculators x.x.

This tutorial is very very thorough.

The ANSI C Lex spec and Yacc (old version of Bison) file are helpful in constructing our own grammar and lex rules.