Designing The Lexical Analyzer

Introduction

As part of the nGineer suite, there was a need to use both a lexical analyzer and a grammatical parser, neither of which were implemented in the .NET Framework, so they had to be written. This article explains the main design of the lexical analyzer as a document to aid those intending to read the code or just learn about the lexical analyzer.

Goals

When I first went to design the lexical analyzer, the main goal I had in mind was to make it as simple as possible. During the process, I had the KiSS principle in my mind. I wanted the most easily understandable code and design.
I decided to separate the lexical analyzer from the grammatical parser in order to more easily leverage its functionality as a tool that is able to disassemble and ‘understand’ the text that makes up the C# code file.

How Lexical Analyzers Work

Every lexical analyzer is made up of a state machine. The main goal is to find the best (i.e. most characters / longest) matching lexical element that fits to be the next in line.

Take for example the sentence:

All your base are belong to us

We have the following available lexical elements: “All” “All your” “your base” “all your base” “are” “to us” “are belong to”, “us”, “us!” and space (” “).
Let’s decide what the first element is. It can be “All” or “All your” (“all your base” may be long and looks like it matches, but we’re working on this character by character and “A” is not the same as “a”). We’ll decide on “All your” because it is the longest match. The second element can be “are” or “are belong to”. Working with the longest match principle again, we’ll select “are belong to”. The third match will be “us” and not “us!” because it is not a part of the analyzed text, even tough it’s longer.

Top-Level Design

Most functional lexical analyzers are just made up of a set of functions, trying to discover what the next character is. Most object oriented lexical analyzers are the same as the functional ones, just wrapped in a class. I decided that nGineer’s lexical analyzer will be made up of self creating classes that will make up an object grid which in turn describes the text. Each lexical element got its own class and I used inheritance to express the “is a” factor (in the coming example: pp-equality-expression is a pp-and-expression -> PPEqualityExpression derives from PPAndExpression).

Let’s take the pp-and-expression for instance:

(Common means of describing lexical grammar are LL(1), LALR, etc. ECMA use a different means that is described in Chapter 5 of the C# standard. We’ll use that notation in this article.)
pp-and-expression::
pp-equality-expression
pp-and-expression whitespace(opt) && whitespace(opt) pp-equality-expression

the code generated from this will be:

using System;
namespace nGineer.LexicalGrammar
{
/// <summary>
/// A.1.10
/// pp-and-expression::
///		pp-equality-expression
///		pp-and-expression whitespace(opt) && whitespace(opt) pp-equality-expression
/// </summary>
public abstract class PPAndExpression : PPOrExpression, ILexicalElement
{
#region ComplexAndExpression
		public class ComplexAndExpression : PPAndExpression, ILexicalElement
{
internal const string AndOperator = "&&";
private PPAndExpression m_AndExpression;
private WhiteSpace m_PreOperatorWhiteSpace = null;
private WhiteSpace m_PostOperatorWhiteSpace = null;
private PPEqualityExpression m_EqualityExpression;
protected ComplexAndExpression()
{
}
internal static new ComplexAndExpression Create(VirtualString codeSegment, PPDeclarationRequestEventHandler requestEventHandler)
{
ComplexAndExpression newObject = null;
if (codeSegment.Contains(AndOperator))
{
codeSegment.SetCheckPoint();
VirtualString tempString = codeSegment.SubstringLast(AndOperator);
int oldLength = tempString.Length;
PPAndExpression andExpression = PPAndExpression.Create(tempString, requestEventHandler);
if (andExpression != null)
{
codeSegment.Advance(oldLength - tempString.Length);
WhiteSpace preOperatorWhiteSpace = WhiteSpace.Create(codeSegment);
if (codeSegment.Length >= AndOperator.Length &&
codeSegment.IndexOf(AndOperator, 0, AndOperator.Length) == 0)
{
codeSegment.Advance(AndOperator.Length);
WhiteSpace postOperatorWhiteSpace = WhiteSpace.Create(codeSegment);
PPEqualityExpression equalityExpression = PPEqualityExpression.Create(codeSegment, requestEventHandler);
if (andExpression != null)
{
newObject = new ComplexAndExpression();
newObject.m_AndExpression = andExpression;
newObject.m_PreOperatorWhiteSpace = preOperatorWhiteSpace;
newObject.m_PostOperatorWhiteSpace = postOperatorWhiteSpace;
newObject.m_EqualityExpression = equalityExpression;
codeSegment.AcceptChanges();
}
else
{
codeSegment.RejectChanges();
}
}
else
{
codeSegment.RejectChanges();
}
}
else
{
codeSegment.RejectChanges();
}
}
return newObject;
}
#region <snip />
			ILexicalElement[] ILexicalElement.ChildElements
{
get
{
return new ILexicalElement[] {
m_AndExpression,
m_PreOperatorWhiteSpace,
new Constant("AndOperator", AndOperator),
m_PostOperatorWhiteSpace,
m_EqualityExpression
};
}
}
#endregion

public override bool Evaluate()
{
return (m_AndExpression.Evaluate() && m_EqualityExpression.Evaluate());
}
public PPAndExpression AndExpression
{
get
{
return m_AndExpression;
}
}
public WhiteSpace PreOperatorWhiteSpace
{
get
{
return m_PreOperatorWhiteSpace;
}
}
public WhiteSpace PostOperatorWhiteSpace
{
get
{
return m_PostOperatorWhiteSpace;
}
}
public PPEqualityExpression EqualityExpression
{
get
{
return m_EqualityExpression;
}
}
}
#endregion

protected PPAndExpression()
{
}
internal static new PPAndExpression Create(VirtualString codeSegment, PPDeclarationRequestEventHandler requestEventHandler)
{
codeSegment.BeginPossibilityCheck();
codeSegment.BeginPossibility();
codeSegment.EndPossibility(PPEqualityExpression.Create(codeSegment, requestEventHandler));
codeSegment.BeginPossibility();
codeSegment.EndPossibility(ComplexAndExpression.Create(codeSegment, requestEventHandler));
return codeSegment.EndPossibilityCheck() as PPAndExpression;
}
#region <snip />
		ILexicalElement[] ILexicalElement.ChildElements
{
get
{
throw new InvalidOperationException("Can not get child elements for abstract classes");
}
}
#endregion
	}
}

The PPAndExpression object can be either a PPEqualityExpression or a combination of a PPAndExpression object, an optional WhiteSpace object, the text “&&”, another optional WhiteSpace object and a PPEqualityExpression object. Failing to find either will mean that the text we’re looking at is not a pp-and-expression.
Note the Create method on both classes which looks to see if there’s matching text coming up then returns a new object if it does.

The Virtual String

Text in .NET is represented using the string class. I wanted to be able to go back and forth on the string without changing it, so I initially created new instances of the string for each call to a Create method. This made the analyzer horribly sluggish.
This is when I came up with the VirtualString, which contains one string instance and an index that can be moved (SetCheckPoint, AcceptChanges, RejectChanges, etc.). It also has a built in mechanism for deciding which element is better (longer) when selecting between several possibilities (BeginPossibility, EndPossibilityCheck, etc.), as was the case with the “All” and “All your” possibilities.

The Pre-processing Directives Problem

Using the design I formed, I came across the problem that this won’t be analyzed correctly:

#define kek
#if kek
#define keke
#endif
#if keke
#endif

This is correct and would render keke as an input section and not a skipped section. This would not be recognized unless there was some table of symbols. But wait a second, what if we were to create a symbol, but it was not a real symbol, but just an option to be discarded?
So what I did was put a list of them in each input section and allow requests to be made using delegates that were propagated to the classes below (see the PPAndExpression example’s Create method).

Walkers

There was a need for a mechanism for walking over the elements’ object grid once it was finished, to use the lexical analyzer to its fullest extent. This is where walkers come into play.
The walkers’ design is simple. Simply implement the base class WalkerBase and add code that does whatever you want whenever a certain element is stumbled upon. There are three basic implementations that come with nGineer – CodeWalker (persists the code back to a file), XmlWalker (which serializes the object grid to Xml) and HighlightedCodeWalker (on which webify is based). There’s more about walkers in the next article, “Implementing A Lexical Walker”.

Using The Lexical Analyzer

Using the lexical analyzer is simple:

nGineer.LexicalGrammar.Input input = new nGineer.LexicalGrammar.CSharpCodeLexer().Analyze(cSharpCode, defs);

Where cSharpCode is a string containing the code and defs is an array of pre-processor conditional symbols that were defined for the file. It is useful when analyzing a file which has #if DEBUG in it, etc.

More examples:

nGineer.LexicalGrammar.Input input = new nGineer.LexicalGrammar.CSharpCodeLexer().Analyze(cSharpCode, "DEBUG");
nGineer.LexicalGrammar.Input input = new nGineer.LexicalGrammar.CSharpCodeLexer().Analyze(cSharpCode, "DEBUG", "SOMETHING_ELSE");
nGineer.LexicalGrammar.Input input = new nGineer.LexicalGrammar.CSharpCodeLexer().Analyze(myFile.ReadToEnd(), "DEBUG", "SOMETHING_ELSE");

Please remember that pre-processing conditional symbols are case sensitive.

Finally

If you have any questions or you think I have left something out, please contact me.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s