Archive for the ‘.NET’ Category

Book review: Microsoft Windows Azure Development Cookbook

1 Comment »

About a week ago PACKT published “Microsoft Windows Azure Development Cookbook” by Neil Mackenzie. I got it from Amazon on Thursday and have been reading it these last couple of days. I’ll share my thoughts below.

azure_cookbook_cover

 

Cookbooks are rarely meant to be read cover-to-cover, so doing just that has been quite a dry read. Had this been a regular, educational book on Windows Azure I would say that it is missing a bit of spice; some war stories, some personal opinions or a pinch of humour would have done wonders for the reading experience. Thus, this isn’t the type of book you bring with you on vacation or read while commuting. It is meant to sit on a shelf next to your work station and to be consulted when you encounter a specific problem.

 

The book contains 80 or so recipes for solving various concrete tasks related to Windows Azure development. They are grouped into chapters on blobs, tables, queues, hosted services, diagnostics, the management API, SQL Azure and AppFabric. Within each chapter recipes are roughly sorted in order of increasing level of complexity and each recipe follows the same pattern:

  • An introduction to the problem / scenario
  • A list of prerequisites for the sample solution
  • Step-by-step instructions on building the sample solution
  • A summary of how the sample solution works

This feels like a sound approach and it works quite well.

 

In the preface of the book it says “If you are an experienced Windows Azure developer or architect who wants to understand advanced development techniques when building highly scalable services using the Windows Azure platform, then this book is for you.” However, if you are just starting starting out with Windows Azure, you shouldn’t let this put you off. There are quite a number of recipes in the book which will be of value to newcomers to the platform, and if you are just starting your first Azure project, you will find many recipes that are immediately applicable and very valuable.

I think this may be the book’s strongest point as well as its Achilles’ heel: If you have taken an introductory course to Windows Azure and been through the Windows Azure Platform Training Kit, you will, at least in theory, know much of the material in this book. That being said, for some recipes Mackenzie adds a “There’s more…” section which puts the material just covered into perspective or relates it to other parts of the Windows Azure Platform. This is my favourite section and is where even seasoned Windows Azure developers may find some valuable nuggets.

 

All-in-all I would highly recommend this book if you have had a general introduction to Windows Azure and is getting ready to get your hands dirty.

On the other hand, if you are an experienced Azure developer it won’t take you to the next level.


Presenting at Miracle Open World 2011

No Comments »

mow2011Black

I will be giving a talk on .NET and Windows Azure at Miracle Open World 2011 in Billund on 15th of April.


Configuring RDP access to Windows Azure VMs

No Comments »

As part of the launch of Windows Azure SDK 1.3 Microsoft has provided us with the possibility of connecting to our Windows Azure VMs using Remote Desktop.

Using Windows Azure Tools for Visual Studio 2010 it is straight forward to configure your application to allow remote desktop access to the VMs:

  • Right-click the service in VS2010 and choose “Publish”
  • Click the link “Configure Remote Desktop Connections…”
  • Check “Enable connections for all roles” and choose or create a certificate to encrypt the password
  • Enter username, password and the user’s expiry date and click “OK”.
  • Perform the rest of the deployment as usual (make sure to upload the certificate, if you chose to create a new one)

If all you want is remote desktop access to your Azure instances this will do. If you want to understand what is going on and how to manually configure your application for remote desktop access then read on.

Manually configuring the application for remote desktop access is a two-part process:

  • The application must be configured to allow RDP connections on port 3389
  • You need to create an encrypted password and enable the Fabric Controller to access the certificate used for the encryption.

Configuring the application

In order for the Windows Azure load balancer to allow inbound RDP-connections, you have to enable this in your service definition. Open the ServiceDefinition.csdef file and locate the WebRole-/WorkerRole sections describing the roles for which you want to allow remote desktop access. Import the module “RemoteAccess” in each section:

<Import moduleName="RemoteAccess" />

 

Because of this import the Fabric Controller will set up internal TCP endpoints on port 3389 for each role.

In addition to these internal endpoints, exactly one role needs to define an input endpoint to which the load balancer can route inbound RDP connections. When you access an instance via RDP, this role will forward the connection to the instance to which you intend to connect.

Thus, you have to add the following to exactly one of the sections in ServiceDefinition.csdef:

<Import moduleName="RemoteForwarder"/>

 

Assuming that we started out with a service containing one web role and one worker role, our service definition now looks like this:

<?xml version="1.0" encoding="utf-8"?>

<ServiceDefinition name="RDService" xmlns="http://schemas.microsoft.com/

ServiceHosting/2008/10/ServiceDefinition">

  <WebRole name="WebRole1">

    <Sites>

      <Site name="Web">

        <Bindings>

          <Binding name="Endpoint1" endpointName="Endpoint1" />

        </Bindings>

      </Site>

    </Sites>

  <Endpoints>

      <InputEndpoint name="Endpoint1" protocol="http" port="80" />

    </Endpoints>

    <Imports>

      <Import moduleName="RemoteAccess"/>

      <Import moduleName="Diagnostics" />

    </Imports>

  </WebRole>

  <WorkerRole name="WorkerRole1">

    <Imports>

      <Import moduleName="RemoteAccess"/>

      <Import moduleName="RemoteForwarder"/>

      <Import moduleName="Diagnostics" />

    </Imports>

  </WorkerRole>

</ServiceDefinition>

If you have used VS2010 to edit ServiceDefinition.csdef you will see that a number of settings have been added to each role in the service configuration file, ServiceConfiguration.cscfg (if you haven’t used VS2010 you will have to add them yourself):

<Setting name="Microsoft.WindowsAzure.Plugins.RemoteAccess.Enabled" value="" />
<Setting name="Microsoft.WindowsAzure.Plugins.RemoteAccess.AccountUsername" value="" />
<Setting name="Microsoft.WindowsAzure.Plugins.RemoteAccess.AccountEncryptedPassword" value="" />
<Setting name="Microsoft.WindowsAzure.Plugins.RemoteAccess.AccountExpiration" value="" />

Moreover, for the role acting as Remote Forwarder VS2010 has added:

<Setting name="Microsoft.WindowsAzure.Plugins.RemoteForwarder.Enabled" value="" />

 

and, finally, a new element has been added to the Certificates section:

<Certificates>
   <Certificate name="Microsoft.WindowsAzure.Plugins.RemoteAccess.PasswordEncryption"
thumbprint="" thumbprintAlgorithm="sha1" />
</Certificates>

The first four settings are obviously there to allow the Fabric Controller’en to configure each VM when the application is deployed. At this point, we can fill in all these fields with the exception of the AccountEncryptedPassword field. In my case, I enter “true”, “rune” and “2010-11-06 00:00:00Z” (see ‘Setting Up A Remote Desktop Connection For A Role’ for information on the date/time format). Moreover, for WorkerRole1  I put “true” for the “Microsoft.WindowsAzure.Plugins.RemoteForwarder.Enabled” field.

Specifying the password

We need to choose a password for our remote desktop user, and for security reasons it needs to be encrypted before we enter it into ServiceConfiguration.cscfg. Thus, we need to perform the following tasks:

  • Create a X509 certificate
  • Encrypt a password using this the certificate 
  • Put the encrypted password in ServiceConfiguration.cscfg
  • Provide the certificate to the Fabric Controller
  • Specify the certificate’s unique name (thumbprint) in ServiceConfiguration.cscsf

To create a X509 certificate run a Visual Studio Command Prompt  as administrator and execute the following command:

makecert –sky exchange -r -n "CN=AzureRD" -pe -a sha1 -len 2048 -ss My "AzureRD.cer"

(you can run makecert –! to see a descriptions of the flags used and read general guidelines for creating a certificate here).

This will create a certificate and store it in the file AzureRD.cer as well as install it into the local certificate store. You can verify this using PowerShell:

PS C:\Users\Rune Ibsen> cd cert:\CurrentUser\My

PS cert:\CurrentUser\My> dir

    Directory: Microsoft.PowerShell.Security\Certificate::CurrentUser\My

Thumbprint Subject
———- ——-
FF89E4AEFF26E891CAD29F9C59F6E5F9050B2337 Windows Azure Tools
55825F4612F9EDCAB073DE98AA5419FA710A2973 AzureRD

Notice that this listing contains the certificates’ thumbprints. You will need your certificate’s thumbprint shortly.

Next, we need to actually use the certificate to encrypt the password. Fire up PowerShell and run the following commands:

[Reflection.Assembly]::LoadWithPartialName("System.Security")
$pass = [Text.Encoding]::UTF8.GetBytes("yourpassword")
$content = new-object Security.Cryptography.Pkcs.ContentInfo –argumentList (,$pass)
$env = new-object Security.Cryptography.Pkcs.EnvelopedCms $content
$env.Encrypt((new-object System.Security.Cryptography.Pkcs.CmsRecipient(gi cert:\CurrentUser\My\55825F4612F9EDCAB073DE98AA5419FA710A2973)))
[Convert]::ToBase64String($env.Encode())

The encoded string resulting from the last command is what you need to put in the fields AccountEncryptedPassword in ServiceConfiguration.cscfg. Moreover, put the certificate’s thumbprint as the value for the thumbprint attribute in ServiceConfiguration.cscfg.

In order for the Fabric Controller to configure the virtual machines with the chosen password, the Fabric Controller needs to have access to the certificate that was used to encrypt the password. In particular, you will have to provide a personal information exchange (PFX) certificate.

To create a PFX certificate go to a command prompt on your local machine and run

certmgr.msc

This will bring up the certificate manager. Locate the newly created certificate, right-click it and choose “All Tasks –> Export…”. Follow the wizard, indicating along the way that you want to export the private key. Once complete, this process will create a .pfx file.

Now go to the Windows Azure portal and create a new hosted service. Then select the Certificates-folder and click the Add Certificate in the upper left corner. Upload the .pfx file you just created.

Next, deploy the application to the hosted service just created and, once deployment is complete, connect to a VM using RDP. Presto! You see the familiar Windows 2008 desktop!

I think it is worth noticing that you can enable and disable remote desktop access on a per role basis at runtime, that is, without having to do a redeploy.

Also notice that if you enumerate the instance endpoints for your roles, you will see that they now have an instance endpoint named “Microsoft.WindowsAzure.Plugins.RemoteAccess.Rdp”. I did this by connecting to an instance via RDP and enumerating the endpoints using Powershell:

image

If you inspect the endpoints closer, you will see that they are listening on port 3389.

References


Architecting for on-premise as well as Azure hosting

2 Comments »

If you’re an ISV and you’re contemplating whether you should migrate your product to Windows Azure, you may be asking yourself if you can have your cake and eat it too, that is, have the exact same code base deployed with customers running on-premise as well as with customers running in the cloud. In this blog post I will offer some technical guidance on this issue.

These are my assumptions:

  1. You want to offer your software as Software as a Service, that is, as a hosted service running on Windows Azure
  2. Each customer will want to have a dedicated virtual machine for running the application (that is, I am not going to go into multi-tenancy here)

I am not making any assumptions on whether your application can currently be scaled horizontally. If it can’t, moving it to Azure will not change this. The Azure load balancers use a round-robin algorithm and have no support for sticky sessions of out the box, so making the application scale will most likely be a non-trivial exercise.

Isolating dependencies

The key to having an application be able to run both on-premise and in Windows Azure is to isolate the application’s dependencies on the hosting environment, hide these dependencies behind suitable abstractions and then provide implementations of these abstractions tailored at each hosting environment.

So, the first step in taking your application to the cloud should be to identify the dependencies for which we will need to provide abstractions. Obviously, we only need to consider the parts of the application’s environment which differ between Windows Azure and an on-premise environment. Typical examples include:

  • Database (SQL Server vs. SQL Azure)
  • Shared file systems

Thus, you need to re-architect your application from something that looks like this:

raw_architecture

to something that looks more like this:

on_premise_architecture

If you are in luck, you already have an abstraction which somewhat isolates the rest of your application from the particular database implementation. This would be the case if you are using an ORM like Entity Framework, Nhibernate or something similar. Isolating other dependencies may require more work.

Taking it to the cloud

At this point, you have an application with a nice encapsulation of external dependencies. Even though the application is only able to run on-premise at this point, the architecture has already been improved.  The next step is obviously to provide implementations of your abstractions suitable for running in the cloud. This is the fun part: If you have identified the right abstractions, you can go nuts in cloud technology, using massively scalable storage, asynchronous communication mechanisms, CDNs etc. This process should give you an application capable of running on Windows Azure:

cloud_architecture

Configuring dependencies

The code base now contains everything we need for an on-premise installation as well as for a cloud deployment. However, to truly decouple the application from its dependencies, the set of dependency encapsulations to use in a given installation should be easily configurable. To this end, use your favorite dependency injection tool. Unless you decide to roll your own tool, the tool you choose will most likely support multiple methods of configuration and you can choose whatever method you prefer.

If you want to get really fancy, you may even choose to have the application configure its dependencies on its own on startup. The application can use the Windows Azure API to tell it whether it is running on Windows Azure. The information is available through the Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment class, which has a static property called IsAvailable:

IsAvailable

I usually hide the environment information provider behind a static gateway pattern.

This was a quick run down of one path towards the cloud. Obviously, there is much more to be said, especially with respect to the ‘taking it to the cloud’-step. I’ll save that for another day.


Packaging a Web Site for Windows Azure

1 Comment »

I am spending a few days participating in the “Bringing Composite C1 to Windows Azure” workshop this week.

I didn’t know Composite C1 beforehand, but so far it has been a pleasure to get to know it.

The only major hiccup we encountered on the first day was that the Composite C1 application is not a Web Application Project in the Visual Studio sense, but is instead a Web Site.

Windows Azure web roles are per default Web Application Projects, so we set out to convert Composite C1 to a Web Application Project while discussing if it is possible to deploy a Web Site to Windows Azure.

Since Windows Azure is just Windows 2008 with IIS 7.0 I figured it should be possible to run a Web Site on Windows Azure, but whether we could get the management services to deploy the Web Site in the first place was another matter.

Coincidentally, Steve Marx recently wrote a blog post on manually packaging up an application for deployment to Windows Azure, so in this blog post I will attempt to deploy a Web Site to Windows Azure using a manual packaging approach. I will be using a generic web site instead of Composite C1 since using Composite C1 would probably cause some unrelated problems to surface.

So, I start out by creating a new Web Site:

default_website_vs

Next, I need to create a service definition which will tell Windows Azure what my web site looks like. As for now, I will be define a service with a single webrole. So, I create a new file called ServiceDefinition.csdef:

creating_servicedefinition

and I fill in some basic parameters:

created_servicedefinition

Now, I could go ahead and package up the application and deploy it. However, if I do this, the web role will be extremely sick and throw exceptions saying “Unrecognized attribute ‘targetFramework’”, referring to the targetFramework=”4.0” attribute in web.config.

Figuring out why this happens and what to do about it will require some further investigation. For now I just go ahead and delete the attribute. This also means that I need to delete the “using System.Linq;” statements in Site.Master, Default.aspx.cs and About.aspx.cs.

To package the application I need to use the cspack.exe that comes with the Windows Azure SDK. I’ve added the SDK’s bin directory to my PATH, so I can go ahead and package the application:

Packaging

This is a pretty long command, so I’ll repeat it here for convenience:

cspack ServiceDefinition.csdef /role:MyWebRole;WebSite1
/copyOnly
/out:MyWebSite.csx /generateConfigurationFile:ServiceC onfiguration.cscfg
Windows(R) Azure(TM) Packaging Tool version 1.2.0.0 for Microsoft(R) .NET Framework 3.5 Copyright (c) Microsoft Corporation. All rights reserved.

c:\AzureWebSiteTest>

 

The cspack application certainly doesn’t seem very verbose. Anyway, I can now go ahead and deploy the web site to the Azure Development Fabric using another tool from the SDK, csrun:

deploying_to_dev_fabric

And presto! I now have a web site running on the Development Fabric:

website_running_on_dev_fabric

To actually deploy the web site to the cloud you need to create a proper deployment package. To do this, leave out the /copyOnly flag from the packaging command:

Packaging_for_cloud

Again, I’ll repeat the command:

cspack ServiceDefinition.csdef /role:MyWebRole;WebSite1 /generateConfigurationFile:ServiceConfiguration.csfg

This will generate a file called ServiceDefinition.cspkg that you can upload through the Windows Azure Portal along with the ServiceConciguration.cscfg.

Once the Fabric Controller has done its thing we have a web site in the cloud:

website_running_in_the_cloud

I had to cut some corners in the proces, but at least this shows that the web site model _can_ run on Windows Azure.


Resetting the Development Fabric deployment counter

1 Comment »

When you are working with Windows Azure on your local machine, each time you deploy your application to the Development Fabric that particular deployment will receive a unique name along the lines of

deployment(<deploymentnumber>)

Here <deploymentnumber> is a number that is incremented each time you deploy an application to Development Fabric. This may start to look ridiculous after a while, so you may want to reset the counter.

The Development Fabric uses a temp folder,

C:\Users\<your name>\AppData\Local\dftmp,

for storing deployed applications.

To reset the deployment counter go to

C:\Users\<your name>\AppData\Local\dftmp\s0

and delete all the previous deployments. Next, use Notepad to open the file

C:\Users\<your name>\AppData\Local\dftmp\_nextid.cnt

and reset the sole value in that file to 0.

Voila!


Unit testing against Windows Azure Development Storage

1 Comment »

When developing an application targeted for Windows Azure, your application will be executing in an environment to which you have somewhat limited access, making debugging hard. Thus, you will probably want to test your application as much as possible before deploying it. Enter unit testing.

To be able to execute unit tests  against the Azure storage (well, maybe such tests are more like integration tests), you need to have the Azure Development Storage running when the tests are executed. To achieve this, you will need to do two things:

  1. Make sure the development storage is available
  2. Make sure the development storage is running before testing begins

The first task is taken care of by installing the Windows Azure SDK. If you are developing Windows Azure applications, you will probably already have this installed on your computer. However, if you plan to run your tests on a dedicated build server, you may need to install the SDK on that machine.

For the second task, you need to start the Development Storage prior to executing tests. To start the Development Storage you can use the CSRun utility from the Azure SDK. If you have installed the SDK to the default position, you will find csrun.exe in C:\Program Files\Windows Azure SDK\v1.0\bin:

image

image

If you want to integrate this task with your build scripts, you can of course create a NAnt target for it:

<property name="azure.sdk.csrun.exe" value="C:\Program Files\Windows Azure SDK\v1.0\bin\csrun.exe"/>

<target name="azure.devstorage.start" description="Starts the Azure Development Storage">
    <exec program="${azure.sdk.csrun.exe}"
        commandline="/devstore:start">
    </exec>
    <sleep seconds="5"/><!-- Sleep 5 seconds, waiting for the development storage to start -->
</target>

As you can see, I have added a delay in order to allow the storage to fully start before subsequent tasks are executed.

Now just call this target before executing your unit tests. You may also want to shut down the Development Storage upon completion of your tests. Use csrun /devstore:shutdown to do this

Note that if your project is the first project to use Development Storage on the build server, you will want to initialize the database underlying Development Storage before running tests. Initializing the Development Storage will create a database called something like DevelopmentStorageDb20090919 in the local SQL Server (Express) instance.

image

image

You just need to do this once. However, if you feel like it, you can force DSInit to recreate the database as part of your test procedure by issuing the command dsinit.exe /forceCreate. DSInit requires administrator privileges, though.


How do I parse a csv file using yacc?

No Comments »

Parsing csv files… it’s tedious, it’s ugly and it’s been around forever. But if you’re working with legacy systems from before the rise of XML, chances are you will have to handle csv files on a regular basis.

Parsing csv files seems like a task perfectly suitable for a standard library or framework. However, this ostensibly easy task is apparently not quite that easy to tackle in a generic way. Despite a substantial effort, I haven’t been able to find a library that I find widely applicable and pleasant to work with. Microsoft BizTalk has some nice flat-file XSD extensions, but enrolling BizTalk on each of your projects involving csv files is hardly a palatable approach. A more light-weight approach might be the Linq2CSV project on CodeProject. This project allows you to define a class describing the entities in each line of the csv file. Once this is done, the project provides easy means for reading the data of a csv file and populating a list of objects of the class just defined. If the input doesn’t conform to the expected format, an exception will be thrown containing descriptions and linenumbers of the errors encountered.  This seems like a really nice approach and I will probably be using it on a few small projects in the near future.

However, the proper way of parsing is of course to bring out the big guns: yacc (Yet Another Compiler Compiler). As its name suggests, yacc is intended for tasks much more complex than parsing a bunch of one-liners. Yacc is a code-generation-tool for generating a parsers from a context free grammar specification. The generated parsers are of a type known as LR parsers and are well suited for implementing DSLs and compilers. Yacc even comes in a variety of flavors, including implementations for C, C#, ML and F#.

Below, I will show you how to parse a csv file in a sample format, generating a corresponding data structure. I will be using the F# yacc implementation (fsyacc) which comes with the F# installation (I’m using F# 1.9.6.2).

The parsing process

When we’re parsing text, the ultimate goal is to recognize the structure of a string of incoming characters and create a corresponding datastructure suitable for later processing. When designing compilers, the datastructure produced by the parser is referred to as an abstract syntax tree, but I will just to refer it as the datastructure.

Parsing the incoming text will fall in these steps:

The Parsing Process

The parser relies on a lexer to turn the incoming string of characters into a sequence of tokens. Tokens are characterized by their type and some may carry annotations. Thus, a token of type INT, representing an integer in the input stream, will have the actual integer value associated to it, because in most cases we will be interested in the particular value, not just the fact that some integer was present in the input. On the other hand, a token of type EOL, representing an end-of-line, will not carry any extra information.

We will not cover the details of the lexer in this posting.

The data format

The data we will be parsing will look like this:


328 15 20.1
328 13 11.1
328 16 129.2
328 19 4.3

Each line contains two integers followed by a decimal value, each separated by whitespace. This is not a traditional csv format, since the values are separated by whitespace. However, it is straightforward to adapt a lexer tokenizing the format above to a lexer procesing a more traditional format.

The dataset represents the result of measuring various health parameters of a person. The first integer of each line identifies a person. The next integer identifies a parameter and the decimal represents the value measured. Thus, if parameter 15 is BMI (body mass index), the first line above states that user 328 has a BMI of 20.1.

To parse this dataformat, we will need the following tokens:

Token Description Annotation
INT An integer The integer value
FLOAT A decimal The decimal value
EOR End of record (end of line) -
EOF End of input (end of file) -

The datastructure

To represent the data, we will use a very simple F# datastructure:

The datastructure

We define the datastructure in F# like this:

module Ast =
type Line = { userid : int; parameterid : int; value : float; }
type DataSet = DataSet of Line list

The context free grammar

For yacc to be able to generate a parser which makes sense of the data format described above, we need to provide yacc with instructions on how to convert a sequence of tokens into the components of the datastructure. We do this in the form of a context free grammar and a set of associated semantic actions. To understand these concepts, let’s have a look at the context free grammar we will actually provide to yacc:

DataSet LineList
LineList Line
  | LineList EOR Line
Line INT INT FLOAT

Each line in the grammar is traditionally called a production, because it states that the term on the left side of the arrow may be expanded into whatever is on the right side of the arrow. The terms on the left are called non-terminals, because they may be expanded into their constituents, namely the elements on the right. In the grammar above, DataSet, LineList and Line are non-terminals. On the other hand, no productions exist expanding EOR, INT or FLOAT. Thus, these elements are said to be terminals. They are the tokens which the lexer may provide.

The fourth production above states that the concept of a Line consists of two consecutive INT tokens and a FLOAT token, in that order. The second and third productions combined state that a LineList is either a Line or consists of a LineList followed by an EOR token and a Line. Thus, if two Line elements separated by an EOR token have been identified by the parser, it may consider this to be a LineList, since the first Line is a LineList by the second production while this sequence of a LineList followed by an EOR token and a Line is itself a LineList by the third production.

You should remember, that while the terms in the context free grammar are intimately related to the elements in our datastructure, these concepts are not the same. Also note that we had to introduce the recursively defined LineList element in the grammar to accomodate the concept of a list of elements.

If you’ve never encountered context free grammars before, a more thorough introduction than what I have provided may be desirable. In this case, you may want to consult Wikipedia.

Semantic actions

The datastructure is constructed in a bottom-up fashion by executing a piece of code each time a production is applied. The piece of code is called the production’s semantic action. For the

Line → INT INT FLOAT

production, we create a corresponding Ast.Line instance (cf. the "The datastructure" section above). These are the semantic actions we will need:

Production Semantic action Description
DataSet LineList DataSet($1) Create a Ast.DataSet, passing the Ast.Line list to the constructor
LineList Line [$1] Create a Ast.Line list containing a single element
  | LineList EOR Line $3 :: $1 Concatenate the Ast.Line to the list of the LineList
Line INT INT FLOAT

{

userid = $1;

parameterid = $2;

value = $3;

}

Create a new Ast.Line assigning the first integer of the line to Ast.Line.userid, the second integer to Ast.Line.parameterid and the float to Ast.Line.value

As you have probably guessed, the $x variables in the fourth semantic action refers to the values of the INT and FLOAT tokens.

When specifying the semantic action for a production to fsyacc, you enclose the appropriate piece of code in braces after the production. Thus, our parser specification will look like this:

%{
open RI.Statistics.Ast
%}

%start start
%token <System.Int32> INT
%token <System.Double> FLOAT
%token EOR
%token EOF
%type <RI.Statistics.Ast.DataSet> start

%%

start: DataSet { $1 }

DataSet: LineList { DataSet($1) }

LineList: Line { [$1] }
| LineList EOR Line { $3 :: $1 }

Line: INT INT FLOAT { { userid = $1; parameterid = $2; value = $3; } }

Generating and exercising the parser

To generate the parser, you run yacc from the command line, passing the name of the file containing the parser specification above as an argument. For fsyacc, we get:



C:\Users\rui\Projects\Statistics\Parser>fsyacc Parser.fsp --module Parser

building tables

computing first function...time: 00:00:00.1604878

building kernels...time: 00:00:00.1359233

building kernel table...time: 00:00:00.0407968

computing lookahead relations.............time: 00:00:00.0723062

building lookahead table...time: 00:00:00.0439270

building action table...time: 00:00:00.0673112

building goto table...time: 00:00:00.0099398

returning tables.

10 states

5 nonterminals

7 terminals

6 productions

#rows in action table: 10

C:\Users\rui\Projects\Statistics\Parser>

This produces two files, Parser.fs and Parser.fsi, which contain the implementation of the parser. We will include them when we compile the parser.

To test the parser, we create a console application which will parse the sample data presented earlier and print the resulting datastructure:

#light
open Lexing
open Parser

let x = @"328 15 0,0
  328 13 11,1
  328 16 129,2
  328 19 4,3"

let parse() =
  let myData =
    let lexbuf = Lexing.from_string x in
      Parser.start Lexer.Parser.tokenize lexbuf in
  myData

let data = parse()
printfn "%A" data

Compiling this with the generated parser and executing the resulting console application results in this:



C:\Users\rui\Projects\Statistics\ConsoleApplication\bin\Debug>ConsoleApplication.exe

DataSet [{userid = 328; parameterid = 19; value = 4.3;};

{userid = 328; parameterid = 16; value = 129.2;};

{userid = 328; parameterid = 13; value = 11.1;};

{userid = 328; parameterid = 15; value = 0.0;}]

C:\Users\rui\Projects\Statistics\ConsoleApplication\bin\Debug>

Presto, we have our datastructure! And with only a minimum amount of code!

 
So, is this really an appropriate approach for parsing csv files? Well, no, not quite. Even though the procedure described above is rather straightforward, there’s no error reporting facility, making it inappropriate for anything but a prototype application. Thus, for parsing csv files, the aforementioned Linq2CSV project, or something similar, will probably give us much more initial headway than the yacc approach. But the yacc approach scales very well with the complexity of the input, hence may become feasible as the complexity of the input increases.

UPDATE: It has come to my attention that Robert Pickering, in the second edition of his book on F#, intends to include a special section on parsing text files using other means than fsyacc. Thus, if you’re reading this posting with the intention of actually using the method described above for production, you may want to consult Robert Pickering’s book for alternatives.


A gotcha when using fslex with #light syntax

No Comments »

Tonight I’ve been implementing a small lexer/parser pair in order to be able to read data from a csv-like file and process the data using F# Interactive. I originally chose F# over C# because F# fits well with the data processing I’m doing. I expected to use a library written in C# for loading the data, but decided to use fslex and fsyacc instead, just for the fun of it. However, I came across a problem with fslex (or my understanding of fslex) and I thought I’d share the solution:

I want all the code for loading the data to go into a module named Parser. Therefore, my Lexer.fsl begins with the following code:

{
module Parser =
  open System
  open Lexing
  open Parser


This section of the lex specification is traditionally called the definition section and contains initial code I want copied into the final lexer. The code produced by fslex will look something like

module Parser =
open System
open Lexing
open Parser

# 8 "Lexer.fs"
let trans : uint16[] array =
[|
(* State 0 *)

(* ...lots of code... *)

let rec __fslex_dummy () = __fslex_dummy()
(* Rule tokenize *)
and tokenize (lexbuf : Microsoft.FSharp.Text.Lexing.LexBuffer<_>) = __fslex_tokenize 0 lexbuf
and __fslex_tokenize __fslex_state lexbuf =
match __fslex_tables.Interpret(__fslex_state,lexbuf) with
| 0 -> (
# 14 "Lexer.fsl"
tokenize lexbuf
# 50 "Lexer.fs"
)
| 1 -> (
# 15 "Lexer.fsl"
EOR
# 55 "Lexer.fs"

(* ...more code... *)

However, when trying to build the generated parser, the compiler would complain that “the value or constructor ‘EOR’ is not defined”. This had me stumbled for a while until I realized, that the code generator wasn’t indenting the code properly. The

let rec __fslex_dummy () = ...

wasn’t indented at all, thus the

open System
open Lexing
open Parser

statements, which were indented to be part of the Parser module, weren’t in scope anymore. To fix this, I had to fall back to the more verbose syntax

module Parser = begin
open System
open Lexing
open Parser

(* code code code *)

end

This made the compiler concur.
Just like you can add any valid F# code to the definition section of the lexer specification, you can add a similar section, called the user subroutines section, to the end of the file. fslex will copy it to the end of the generated code. Thus, we can have fslex generate the desired code by changing the definition section of the lexer specification to

{
module Parser = begin
open System
open Lexing
open Parser
}

and adding a closing section like

{
end
}

I guess the code generator ought to have handled the problem by indenting the generated code properly, but this is just a CTP, so hopefully it will be fixed in the future.


F# and the Euler project

2 Comments »

At JAOO 2008 I attended the presentation “Learning F# and the Functional Point of View” by Robert Pickering. It’s been quite a while since I’ve done functional programming (I think it was when we implemented a small compiler in SML at DIKU a few years back), so this was a nice opportunity to revisit the functional paradigm. Since “real men use the stack” and since functional programming and its immutable datastructures will really come into its own when we’re talking concurrent applications and the plethora-of-processors machines of the future, it is about time we start considering using functional programming in main stream applications.

All of the above lead me to pick up F# and a book (“Foundations of F#“, also by Robert) this past weekend. In order to get a real feel for the language I decided to tackle some of the problems at Project Euler while reading the book. Thus, without further ado, here’s the Euler Project’s Problem #3 and my solution using F#:

Problem 3:
The prime factors of 13195 are 5, 7, 13 and 29.
What is the largest prime factor of the number 600851475143 ?

Solution using F#:

(comments below)

let doesNotContainAsFactors (p:bigint) xs =
(List.for_all (fun a -> not (p % a = 0I)) xs)

let rec nextPrime (x:bigint) primes =
    if doesNotContainAsFactors x primes then
        x, x::primes
    else
        nextPrime (x+1I) primes

(* Find largest prime factor in x*)
let largestPrimeFactor x =
    let rec innerLargestPrimeFactor (x:bigint) primesLessThanCandidate candidate =
        if candidate = x then
            x
        else
            match x % candidate with
            | 0I -> innerLargestPrimeFactor
                (x / candidate) primesLessThanCandidate candidate
            | _ ->
                match (nextPrime candidate primesLessThanCandidate) with
                (newPrime, newPrimesList) ->
                    innerLargestPrimeFactor x newPrimesList newPrime
            innerLargestPrimeFactor x [] 2I

First, we define a function

doesNotContainAsFactors : bigint -> BigInt list -> bool.

This function takes a list of integers xs (as you can see from definition they are really bigints, but for convenience I’m going to continue to refer to them as integers) and an integer p. The function returns true if no element of xs is a divisor in p. Thus, if xs contains all primes smaller than p then doesNotContainAsFactors will return true exactly when p is a prime.
The next function to consider is

nextPrime : bigint -> BigInt list -> bigint * bigint list.

This function takes an integer x and a list primes of integers. When xs contains all the primes less than x the function uses doesNotContainAsFactors to find the first prime greater than or equal to x. It returns this prime along with a list of all primes less than or equal to the newly found prime.
Next comes the main function,

largestPrimeFactor : bigint -> bigint

which is to take an integer x and return its largest prime factor. It does this by testing whether 2 divides x. If it doesn’t, nextPrime is used to find the next prime and the test is repeated. When a prime divides x we factor out the prime by repeating the test on the quotient. Since the primes we test are continually getting larger, we will eventually have factored out all but the largest prime divisor. Once multiple occurences of this largest prime divisor have been factored out, the candidate prime we are trying to divide into x will now equal x. Thus, once we recognize that this is the case, we return x.

So, what is the largest prime factor of 600851475143?. Well, we evaluate largestPrimeFactor 600851475143I and see that the answer is 6857.

I am not entirely happy with the performance of the functions above. Finding the largest prime of 600851475143 does take a couple of seconds on my laptop. I haven’t analyzed the complexity of the code above, but it sure doesn’t feel right. That might be a topic for a future post. Also, comments are welcome :-)