Archive for the ‘F#’ Category

How do I parse a csv file using yacc?

No Comments »

Parsing csv files… it’s tedious, it’s ugly and it’s been around forever. But if you’re working with legacy systems from before the rise of XML, chances are you will have to handle csv files on a regular basis.

Parsing csv files seems like a task perfectly suitable for a standard library or framework. However, this ostensibly easy task is apparently not quite that easy to tackle in a generic way. Despite a substantial effort, I haven’t been able to find a library that I find widely applicable and pleasant to work with. Microsoft BizTalk has some nice flat-file XSD extensions, but enrolling BizTalk on each of your projects involving csv files is hardly a palatable approach. A more light-weight approach might be the Linq2CSV project on CodeProject. This project allows you to define a class describing the entities in each line of the csv file. Once this is done, the project provides easy means for reading the data of a csv file and populating a list of objects of the class just defined. If the input doesn’t conform to the expected format, an exception will be thrown containing descriptions and linenumbers of the errors encountered.  This seems like a really nice approach and I will probably be using it on a few small projects in the near future.

However, the proper way of parsing is of course to bring out the big guns: yacc (Yet Another Compiler Compiler). As its name suggests, yacc is intended for tasks much more complex than parsing a bunch of one-liners. Yacc is a code-generation-tool for generating a parsers from a context free grammar specification. The generated parsers are of a type known as LR parsers and are well suited for implementing DSLs and compilers. Yacc even comes in a variety of flavors, including implementations for C, C#, ML and F#.

Below, I will show you how to parse a csv file in a sample format, generating a corresponding data structure. I will be using the F# yacc implementation (fsyacc) which comes with the F# installation (I’m using F# 1.9.6.2).

The parsing process

When we’re parsing text, the ultimate goal is to recognize the structure of a string of incoming characters and create a corresponding datastructure suitable for later processing. When designing compilers, the datastructure produced by the parser is referred to as an abstract syntax tree, but I will just to refer it as the datastructure.

Parsing the incoming text will fall in these steps:

The Parsing Process

The parser relies on a lexer to turn the incoming string of characters into a sequence of tokens. Tokens are characterized by their type and some may carry annotations. Thus, a token of type INT, representing an integer in the input stream, will have the actual integer value associated to it, because in most cases we will be interested in the particular value, not just the fact that some integer was present in the input. On the other hand, a token of type EOL, representing an end-of-line, will not carry any extra information.

We will not cover the details of the lexer in this posting.

The data format

The data we will be parsing will look like this:


328 15 20.1
328 13 11.1
328 16 129.2
328 19 4.3

Each line contains two integers followed by a decimal value, each separated by whitespace. This is not a traditional csv format, since the values are separated by whitespace. However, it is straightforward to adapt a lexer tokenizing the format above to a lexer procesing a more traditional format.

The dataset represents the result of measuring various health parameters of a person. The first integer of each line identifies a person. The next integer identifies a parameter and the decimal represents the value measured. Thus, if parameter 15 is BMI (body mass index), the first line above states that user 328 has a BMI of 20.1.

To parse this dataformat, we will need the following tokens:

Token Description Annotation
INT An integer The integer value
FLOAT A decimal The decimal value
EOR End of record (end of line) -
EOF End of input (end of file) -

The datastructure

To represent the data, we will use a very simple F# datastructure:

The datastructure

We define the datastructure in F# like this:

module Ast =
type Line = { userid : int; parameterid : int; value : float; }
type DataSet = DataSet of Line list

The context free grammar

For yacc to be able to generate a parser which makes sense of the data format described above, we need to provide yacc with instructions on how to convert a sequence of tokens into the components of the datastructure. We do this in the form of a context free grammar and a set of associated semantic actions. To understand these concepts, let’s have a look at the context free grammar we will actually provide to yacc:

DataSet LineList
LineList Line
  | LineList EOR Line
Line INT INT FLOAT

Each line in the grammar is traditionally called a production, because it states that the term on the left side of the arrow may be expanded into whatever is on the right side of the arrow. The terms on the left are called non-terminals, because they may be expanded into their constituents, namely the elements on the right. In the grammar above, DataSet, LineList and Line are non-terminals. On the other hand, no productions exist expanding EOR, INT or FLOAT. Thus, these elements are said to be terminals. They are the tokens which the lexer may provide.

The fourth production above states that the concept of a Line consists of two consecutive INT tokens and a FLOAT token, in that order. The second and third productions combined state that a LineList is either a Line or consists of a LineList followed by an EOR token and a Line. Thus, if two Line elements separated by an EOR token have been identified by the parser, it may consider this to be a LineList, since the first Line is a LineList by the second production while this sequence of a LineList followed by an EOR token and a Line is itself a LineList by the third production.

You should remember, that while the terms in the context free grammar are intimately related to the elements in our datastructure, these concepts are not the same. Also note that we had to introduce the recursively defined LineList element in the grammar to accomodate the concept of a list of elements.

If you’ve never encountered context free grammars before, a more thorough introduction than what I have provided may be desirable. In this case, you may want to consult Wikipedia.

Semantic actions

The datastructure is constructed in a bottom-up fashion by executing a piece of code each time a production is applied. The piece of code is called the production’s semantic action. For the

Line → INT INT FLOAT

production, we create a corresponding Ast.Line instance (cf. the "The datastructure" section above). These are the semantic actions we will need:

Production Semantic action Description
DataSet LineList DataSet($1) Create a Ast.DataSet, passing the Ast.Line list to the constructor
LineList Line [$1] Create a Ast.Line list containing a single element
  | LineList EOR Line $3 :: $1 Concatenate the Ast.Line to the list of the LineList
Line INT INT FLOAT

{

userid = $1;

parameterid = $2;

value = $3;

}

Create a new Ast.Line assigning the first integer of the line to Ast.Line.userid, the second integer to Ast.Line.parameterid and the float to Ast.Line.value

As you have probably guessed, the $x variables in the fourth semantic action refers to the values of the INT and FLOAT tokens.

When specifying the semantic action for a production to fsyacc, you enclose the appropriate piece of code in braces after the production. Thus, our parser specification will look like this:

%{
open RI.Statistics.Ast
%}

%start start
%token <System.Int32> INT
%token <System.Double> FLOAT
%token EOR
%token EOF
%type <RI.Statistics.Ast.DataSet> start

%%

start: DataSet { $1 }

DataSet: LineList { DataSet($1) }

LineList: Line { [$1] }
| LineList EOR Line { $3 :: $1 }

Line: INT INT FLOAT { { userid = $1; parameterid = $2; value = $3; } }

Generating and exercising the parser

To generate the parser, you run yacc from the command line, passing the name of the file containing the parser specification above as an argument. For fsyacc, we get:



C:\Users\rui\Projects\Statistics\Parser>fsyacc Parser.fsp --module Parser

building tables

computing first function...time: 00:00:00.1604878

building kernels...time: 00:00:00.1359233

building kernel table...time: 00:00:00.0407968

computing lookahead relations.............time: 00:00:00.0723062

building lookahead table...time: 00:00:00.0439270

building action table...time: 00:00:00.0673112

building goto table...time: 00:00:00.0099398

returning tables.

10 states

5 nonterminals

7 terminals

6 productions

#rows in action table: 10

C:\Users\rui\Projects\Statistics\Parser>

This produces two files, Parser.fs and Parser.fsi, which contain the implementation of the parser. We will include them when we compile the parser.

To test the parser, we create a console application which will parse the sample data presented earlier and print the resulting datastructure:

#light
open Lexing
open Parser

let x = @"328 15 0,0
  328 13 11,1
  328 16 129,2
  328 19 4,3"

let parse() =
  let myData =
    let lexbuf = Lexing.from_string x in
      Parser.start Lexer.Parser.tokenize lexbuf in
  myData

let data = parse()
printfn "%A" data

Compiling this with the generated parser and executing the resulting console application results in this:



C:\Users\rui\Projects\Statistics\ConsoleApplication\bin\Debug>ConsoleApplication.exe

DataSet [{userid = 328; parameterid = 19; value = 4.3;};

{userid = 328; parameterid = 16; value = 129.2;};

{userid = 328; parameterid = 13; value = 11.1;};

{userid = 328; parameterid = 15; value = 0.0;}]

C:\Users\rui\Projects\Statistics\ConsoleApplication\bin\Debug>

Presto, we have our datastructure! And with only a minimum amount of code!

 
So, is this really an appropriate approach for parsing csv files? Well, no, not quite. Even though the procedure described above is rather straightforward, there’s no error reporting facility, making it inappropriate for anything but a prototype application. Thus, for parsing csv files, the aforementioned Linq2CSV project, or something similar, will probably give us much more initial headway than the yacc approach. But the yacc approach scales very well with the complexity of the input, hence may become feasible as the complexity of the input increases.

UPDATE: It has come to my attention that Robert Pickering, in the second edition of his book on F#, intends to include a special section on parsing text files using other means than fsyacc. Thus, if you’re reading this posting with the intention of actually using the method described above for production, you may want to consult Robert Pickering’s book for alternatives.


A gotcha when using fslex with #light syntax

No Comments »

Tonight I’ve been implementing a small lexer/parser pair in order to be able to read data from a csv-like file and process the data using F# Interactive. I originally chose F# over C# because F# fits well with the data processing I’m doing. I expected to use a library written in C# for loading the data, but decided to use fslex and fsyacc instead, just for the fun of it. However, I came across a problem with fslex (or my understanding of fslex) and I thought I’d share the solution:

I want all the code for loading the data to go into a module named Parser. Therefore, my Lexer.fsl begins with the following code:

{
module Parser =
  open System
  open Lexing
  open Parser


This section of the lex specification is traditionally called the definition section and contains initial code I want copied into the final lexer. The code produced by fslex will look something like

module Parser =
open System
open Lexing
open Parser

# 8 "Lexer.fs"
let trans : uint16[] array =
[|
(* State 0 *)

(* ...lots of code... *)

let rec __fslex_dummy () = __fslex_dummy()
(* Rule tokenize *)
and tokenize (lexbuf : Microsoft.FSharp.Text.Lexing.LexBuffer<_>) = __fslex_tokenize 0 lexbuf
and __fslex_tokenize __fslex_state lexbuf =
match __fslex_tables.Interpret(__fslex_state,lexbuf) with
| 0 -> (
# 14 "Lexer.fsl"
tokenize lexbuf
# 50 "Lexer.fs"
)
| 1 -> (
# 15 "Lexer.fsl"
EOR
# 55 "Lexer.fs"

(* ...more code... *)

However, when trying to build the generated parser, the compiler would complain that “the value or constructor ‘EOR’ is not defined”. This had me stumbled for a while until I realized, that the code generator wasn’t indenting the code properly. The

let rec __fslex_dummy () = ...

wasn’t indented at all, thus the

open System
open Lexing
open Parser

statements, which were indented to be part of the Parser module, weren’t in scope anymore. To fix this, I had to fall back to the more verbose syntax

module Parser = begin
open System
open Lexing
open Parser

(* code code code *)

end

This made the compiler concur.
Just like you can add any valid F# code to the definition section of the lexer specification, you can add a similar section, called the user subroutines section, to the end of the file. fslex will copy it to the end of the generated code. Thus, we can have fslex generate the desired code by changing the definition section of the lexer specification to

{
module Parser = begin
open System
open Lexing
open Parser
}

and adding a closing section like

{
end
}

I guess the code generator ought to have handled the problem by indenting the generated code properly, but this is just a CTP, so hopefully it will be fixed in the future.


F# and the Euler project

2 Comments »

At JAOO 2008 I attended the presentation “Learning F# and the Functional Point of View” by Robert Pickering. It’s been quite a while since I’ve done functional programming (I think it was when we implemented a small compiler in SML at DIKU a few years back), so this was a nice opportunity to revisit the functional paradigm. Since “real men use the stack” and since functional programming and its immutable datastructures will really come into its own when we’re talking concurrent applications and the plethora-of-processors machines of the future, it is about time we start considering using functional programming in main stream applications.

All of the above lead me to pick up F# and a book (“Foundations of F#“, also by Robert) this past weekend. In order to get a real feel for the language I decided to tackle some of the problems at Project Euler while reading the book. Thus, without further ado, here’s the Euler Project’s Problem #3 and my solution using F#:

Problem 3:
The prime factors of 13195 are 5, 7, 13 and 29.
What is the largest prime factor of the number 600851475143 ?

Solution using F#:

(comments below)

let doesNotContainAsFactors (p:bigint) xs =
(List.for_all (fun a -> not (p % a = 0I)) xs)

let rec nextPrime (x:bigint) primes =
    if doesNotContainAsFactors x primes then
        x, x::primes
    else
        nextPrime (x+1I) primes

(* Find largest prime factor in x*)
let largestPrimeFactor x =
    let rec innerLargestPrimeFactor (x:bigint) primesLessThanCandidate candidate =
        if candidate = x then
            x
        else
            match x % candidate with
            | 0I -> innerLargestPrimeFactor
                (x / candidate) primesLessThanCandidate candidate
            | _ ->
                match (nextPrime candidate primesLessThanCandidate) with
                (newPrime, newPrimesList) ->
                    innerLargestPrimeFactor x newPrimesList newPrime
            innerLargestPrimeFactor x [] 2I

First, we define a function

doesNotContainAsFactors : bigint -> BigInt list -> bool.

This function takes a list of integers xs (as you can see from definition they are really bigints, but for convenience I’m going to continue to refer to them as integers) and an integer p. The function returns true if no element of xs is a divisor in p. Thus, if xs contains all primes smaller than p then doesNotContainAsFactors will return true exactly when p is a prime.
The next function to consider is

nextPrime : bigint -> BigInt list -> bigint * bigint list.

This function takes an integer x and a list primes of integers. When xs contains all the primes less than x the function uses doesNotContainAsFactors to find the first prime greater than or equal to x. It returns this prime along with a list of all primes less than or equal to the newly found prime.
Next comes the main function,

largestPrimeFactor : bigint -> bigint

which is to take an integer x and return its largest prime factor. It does this by testing whether 2 divides x. If it doesn’t, nextPrime is used to find the next prime and the test is repeated. When a prime divides x we factor out the prime by repeating the test on the quotient. Since the primes we test are continually getting larger, we will eventually have factored out all but the largest prime divisor. Once multiple occurences of this largest prime divisor have been factored out, the candidate prime we are trying to divide into x will now equal x. Thus, once we recognize that this is the case, we return x.

So, what is the largest prime factor of 600851475143?. Well, we evaluate largestPrimeFactor 600851475143I and see that the answer is 6857.

I am not entirely happy with the performance of the functions above. Finding the largest prime of 600851475143 does take a couple of seconds on my laptop. I haven’t analyzed the complexity of the code above, but it sure doesn’t feel right. That might be a topic for a future post. Also, comments are welcome :-)