Lexing and Parsing with F# – Part I
September 10th 2010 | David Cooksey
Lexing and Parsing with F# – Part 1
FsLex and FsYacc are F# implementations of Ocaml’s Lex and Yacc. They are part of the F# Powerpack released for Visual Studio 2010. Used together, they take an input string and create a parse tree of typed objects. Once you have the parse tree, you can do anything with it—generate code for another language, interpret it directly, etc. If you’d like to jump right into the code, scroll to the bottom of the post. Note that the code includes a small test project using FsUnit.
FsLex is the lexer part of the lexer-parser pair. It converts an input string or stream into a series of tokens. A token is simply a string labeled so that the parser knows how to handle it. For example, ‘92’ might be within a token labeled DIGIT. Simply put, the lexer’s job is to take a meaningless string, label the parts of it relevant to the language, and hand the labeled parts, in order, to the parser.
FsYacc is the parser component. The parser’s job is to impose syntactical rules onto the stream of typed tokens generated by the lexer. In an analogy to spoken languages, the tokenizer recognizes words and the parser recognizes correct sentence structures.
Let’s start with a very simple syntax that requires a valid program to be a single assignment of an integer or string to a variable. Working with Lex and Yacc primarily requires altering three files. Lexer.fsl defines the allowed tokens (the lexer component) and Parser.fsy defines the allowed syntax (the parsing component). The third file (which we will call Statements.cs) defines the types that we will use to generate the parse tree.
In it we define allowed types for Stmt and expr. Stmt can be only a single type, called Assign, which is a tuple consisting of a string and an expr. Its dependent type, expr, can be a string (typed as Val) or an Int32 typed as Int.
Note that this single file gives a very concise picture of what kind of language we are parsing. Assignments of strings and integers are allowed, nothing else.
Next comes the lexer.
Take note of a few things here. The ‘open Parser’ statement on line 5 indicates that the lexer is dependent on the parser. This is because the tokens (ID, INT, EQUAL, etc.) used on lines 18-23 are defined in the Parser file. The syntax used for matching tokens is similar to regular expressions. Line 18 ignores whitespace characters (defined above as space, tab, carriage return, and newline) by calling tokenize on the next character in the buffer. The ID and INT tokens can contain multiple characters. Token patterns can be defined above (like digit, equal, semicolon, …) or directly within the parse rules (such as the allowed pattern for ID). Any character in the input stream that does not match one of these patterns will generate an error.
Last, the parser.
Here, it all comes together. Lines 7 and 8 indicate that the program starts with start: which is of type Stmt as defined in the Statements file. Line 16 declares the program must consist of a single StmtExp, which expands according to the remainder of the rules. Rules are defined as an allowed pattern of tokens (defined above using %token ) and the types that they generate. The $1, $2, $3 references inside the types indicate the 1-based index position within the tokens to the left. For example, the $2 on line 19 refers to the contents of the ID token. Let’s expand the syntax above for a specific example.
StmtExp -> ID EQUAL Expr SEMICOLON -> ID EQUAL INT SEMICOLON -> varname = 5;
Which becomes Assign(“varname”, Int(5))
StmtExp -> ID EQUAL Expr SEMICOLON -> ID EQUAL QUOTE ID QUOTE SEMICOLON -> varname = “string”;
Which becomes Assign(“varname”, Val(“string”))
We’ll see more complicated expansions, along with a bit more of the theory, in part 2.
One interesting expansion of the lexer/parser would be to adjust it to handle typed variable definitions.
Fslex and FsYacc Info: http://en.wikibooks.org/wiki/F_Sharp_Programming/Lexing_and_Parsing