Home > F# > Lexing and Parsing with F# – Part I

Lexing and Parsing with F# – Part I

September 10th 2010 | David Cooksey

Lexing and Parsing with F# – Part 1

FsLex and FsYacc are F# implementations of Ocaml’s Lex and Yacc. They are part of the F# Powerpack released for Visual Studio 2010. Used together, they take an input string and create a parse tree of typed objects. Once you have the parse tree, you can do anything with it—generate code for another language, interpret it directly, etc. If you’d like to jump right into the code, scroll to the bottom of the post. Note that the code includes a small test project using FsUnit.


FsLex is the lexer part of the lexer-parser pair
. It converts an input string or stream into a series of tokens. A token is simply a string labeled so that the parser knows how to handle it. For example, ‘92’ might be within a token labeled DIGIT. Simply put, the lexer’s job is to take a meaningless string, label the parts of it relevant to the language, and hand the labeled parts, in order, to the parser.


FsYacc is the parser component.
The parser’s job is to impose syntactical rules onto the stream of typed tokens generated by the lexer. In an analogy to spoken languages, the tokenizer recognizes words and the parser recognizes correct sentence structures.
Let’s start with a very simple syntax that requires a valid program to be a single assignment of an integer or string to a variable. Working with Lex and Yacc primarily requires altering three files. Lexer.fsl defines the allowed tokens (the lexer component) and Parser.fsy defines the allowed syntax (the parsing component). The third file (which we will call Statements.cs) defines the types that we will use to generate the parse tree.

In it we define allowed types for Stmt and expr.  Stmt can be only a single type, called Assign, which is a tuple consisting of a string and an expr.  Its dependent type, expr, can be a string (typed as Val) or an Int32 typed as Int.

Note that this single file gives a very concise picture of what kind of language we are parsing. Assignments of strings and integers are allowed, nothing else.

Next comes the lexer.


Take note of a few things here.  The ‘open Parser’ statement on line 5 indicates that the lexer is dependent on the parser.  This is because the tokens (ID, INT, EQUAL, etc.) used on lines 18-23 are defined in the Parser file.  The syntax used for matching tokens is similar to regular expressions.  Line 18 ignores whitespace characters (defined above as space, tab, carriage return, and newline) by calling tokenize on the next character in the buffer.  The ID and INT tokens can contain multiple characters.  Token patterns can be defined above (like digit, equal, semicolon, …) or directly within the parse rules (such as the allowed pattern for ID).  Any character in the input stream that does not match one of these patterns will generate an error.

Last, the parser.


Here, it all comes together.  Lines 7 and 8 indicate that the program starts with start: which is of type Stmt as defined in the Statements file.  Line 16 declares the program must consist of a single StmtExp, which expands according to the remainder of the rules.   Rules are defined as an allowed pattern of tokens (defined above using %token ) and the types that they generate.  The $1, $2, $3 references inside the types indicate the 1-based index position within the tokens to the left.  For example, the $2 on line 19 refers to the contents of the ID token.  Let’s expand the syntax above for a specific example.

StmtExp -> ID EQUAL Expr SEMICOLON -> ID EQUAL INT SEMICOLON -> varname = 5;

Which becomes Assign(“varname”, Int(5))

StmtExp -> ID EQUAL Expr SEMICOLON -> ID EQUAL QUOTE ID QUOTE SEMICOLON -> varname = “string”;

Which becomes Assign(“varname”, Val(“string”))

We’ll see more complicated expansions, along with a bit more of the theory, in part 2.

One interesting expansion of the lexer/parser would be to adjust it to handle typed variable definitions.

Setting up FsLex and FsYacc within a project requires a few manual changes.  See this blog by Don Syme for details on setting up your project manually (or use the project file included here).

Resources:

Fslex and FsYacc Info: http://en.wikibooks.org/wiki/F_Sharp_Programming/Lexing_and_Parsing

F# Powepack: http://fsharppowerpack.codeplex.com/wikipage?title=FsYacc

FSUnit: http://fsunit.codeplex.com/

David Cooksey is a Senior .NET developer at Thycotic Software, an agile software services and product development company based in Washington DC. Secret Server is our flagship password management software product.

Categories: F# Tags: , ,
  1. October 2, 2012 at 7:07 pm

    Thank you! This helped me a lot learning how FsLex/FsYacc works together. This is the best guide on the topic I found so far. Kudos!

  1. September 12, 2010 at 7:07 pm
  2. May 23, 2012 at 6:05 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: