Calvin seal CS 382: Special Topics - Compilers
Fall 2004

Unit Testing & ANTLR: The Scanner

Unit Testing a Scanner

The first step in writing your scanner is determining what the lexical tokens are in your language. In my spreadsheet language, I have to worry about several types of information:

  • integers (e.g., 1234)
  • operators (e.g., =, +, -, etc.)
  • names of cells (e.g., A23, AR9, etc.)

This is a fairly simple language, but it turns out that there are still quite a few lexical tokens for me to recognize.

I start by creating my JUnit test-case class named SpreadsheetLexerTest, but instead of extending the junit.frameworks.TestCase class, extend org.norecess.antlr.ANTLRTestCase. This requires you to define three methods: makeLexer(String), makeParser(TokenStream), and makeTreeParser(). Since the latter two are irrelevant when testing a scanner, these methods can return null.

I do need a good definition for makeLexer(String). Here's what I come up with:

protected TokenStream makeLexer(String input) {
return new SpreadsheetLexer(new StringReader(input));
}

This constructor is currently undefined because I haven't started my ANTLR lexer yet. When I get it going, ANTLR will create a constructor for me that receives a java.io.Reader; I provide a Reader made from a string.

So now I create my lexer. I open a file named spreadsheet.g to write my lexer. First, I start with a header:

header {
package edu.calvin.compilers.spreadsheet;
}

This sets the Java pacakge where the ANTLR output will be found.

Then I can start my lexer right after this:

class SpreadsheetLexer extends Lexer;
options {
charVocabulary='\0'..'\377' | '\u1000'..'\u1fff';
testLiterals=false;
}

The first line here creates a class named SpreadsheetLexer as a lexer. The first option sets the valid characters in my language, and the second says that literals (i.e., keywords) will not be tested (at least not yet).

ANTLR will be unhappy with this since there aren't any productions yet. I create a test in SpreadsheetLexerTest to test for my first production:

public void testInteger() throws TokenStreamException {
assertToken(INTEGER, "123", "123");
}

The ANTLR-testing library defined this assert method. The first argument is the type of token (which will be undefined for right now), the second argument is the text that forms this token, and the third argument is the input to scan.

Keep in mind that although we might think of an integer as a Java int, that's a separate conversion we'll have to do. Scanning is done in terms of text. So the second argument here must be a String.

INTEGER will be the name of the production I'm about to add; ANTLR creates this as a constant in a Java interface that it produces. In anticipation of this interface, I have my SpreadsheetLexerTest implement the SpreadsheetLexerTokenTypes interface.

Finally, to get all of the compilation errors to disappear, I write my production:

INTEGER
  : 'q'
  ;

This is dumb definition, but this will allow everything to compile. When I run my tests, I'll get a red bar, proving that my test is being executed. After I get this red bar, I put in a better definition:

INTEGER
  : ('0'..'9')+
  ;

An integer is any sequence of digits; the + ensures that I get at least one digit, and allows for an unbounded number of them. The .. is just a shortcut notation for alternation over a range of characters.

Now when I run my unit tests, I get a green bar.

I can do a little refactoring here. Perhaps I'd like to encapsulate the range of character that can appear in an integer. I can create this rule:

protected
DIGIT
  : '0'..'9'
  ;

The protected keyword here just means that this production can be used only internally in the lexer. The lexer will never return a DIGIT token. I replace the range in the INTEGER production with DIGIT, and run my tests for another green bar.

I now add a couple more integer assertions to testInteger() just to make sure it's rhobust enough.

Next, I can move on to recognizing cell addresses. I add this test method to SpreadsheetLexerTest:

public void testAddress() {
assertToken(ADDRESS, "A2", "A2");
}

Once again, ADDRESS is undefined until I write a production for it. I go with a dummy definition in spreadsheet.g:

ADDRESS
  : 'q'
  ;

Compile and run my test for red bar. So I fix the definition of an address. An address in a spreadsheet equation consists of letters followed by digits. My test suggests that I need only worry about one letter and one digit, so:

ADDRESS
  : ('A'..'Z') DIGIT
  ;

I run my unit tests for a green bar. My unit tests aren't complete, but first I want to do some more lexer refactoring. Using a range of characters for the letters seems odd when I use DIGIT for the digit range, so I can create another protected production:

protected
LETTER
  : 'A'..'Z'
  ;

And I change the definition of ADDRESS appropriately. I run my unit tests for another green bar.

Now I fix the limitations of my addresses. In particular, this will give me another red bar:

assertToken(ADDRESS, "ZA2", "ZA2");

I add this to testAddress(), and run my unit tests for the red bar. To get my green bar, I could modify the definition of ADDRESS to accept one or two letters, but we get the idea now: at least one, but possibly more. It's again a plus sign:

ADDRESS
  : (LETTER)+ DIGIT
  ;

My unit tests pass again.

Finally, I add another test with more than one digit in an address; I run my unit tests for a red bar. Then I fix ADDRESS one last time, and run my tests for a green bar.

My next step in writing my lexer would be to test the operator tokens: +, -, *, /, (, ), and =. These would be single-character tokens, so I'll end up with seven new tests and seven new productions.

Eventually, I may discover that I want to deal with whitespace. That is, instead of forcing my user to type A1+5, my user could write A1 + 5. As with most languages, I don't want this whitespace going to the parser; the lexer should ignore it.

So I add yet another test:

public void testWhitespace() throws TokenStreamException {
assertToken(INTEGER, "123", "     123");
}

Notice all of the spaces at the beginning of the input. My expected value says those should disappear. Yet, when I run this test, I get a red bar, and the complaint is (basically) that an integer cannot start with a space.

It's easy enough to create a production for whitespace:

WHITESPACE
  : ' '
  ;

Now the test will still fail, but the complain will be that the next token is a WHITESPACE token, not an INTEGER. The solution in ANTLR is to add an action to this rule:

WHITESPACE
  : ' '
          { $setType(Token.SKIP); }
  ;

The $setType is an ANTLR directive for setting the type of a token. Token is an ANTLR class; SKIP is just one of the predefined token types in the Token class. And the rule is that skipped tokens are not put out on the token stream. The lexer will continue to process the character stream until an unskipped token is found. So the tests should all green bar once again.

However, this only tests spaces. There are tabs and form feeds to worry about:

assertToken(INTEGER, "123", "\t\f123");

This test red bars; to fix it, I add the new whitespace characters to the whitespace list:

WHITESPACE
  : ( ' ' | '\t' | '\f' )
          { $setType(Token.SKIP); }
  ;

Lastly, I have to deal with newlines. There are two tricks here: (1) I need to let ANTLR know when the lexer has come across a newline (at least if we want our line counter to be accurate); (2) a newline may be represented three different ways (depending on the operating system that created the file). First, I create this test:

assertToken(INTEGER, "123", "\n123");

It red bars, of course. To fix it, I turn the whitespace production into this monstrosity:

WHITESPACE
      : ( ' ' | '\t' | '\f'
          | ( options generateAmbigWarnings=false;
              : "\r\n"  // DOS/Windows
              | '\r'    // Macintosh
              | '\n'    // Unix
              ) newline();
    ) $setType(Token.SKIP);
  ;

At a high level (the outermost set of parentheses), this says that a whitespace character is a space, a tab, a form feed, or a newline, and this token should be skipped. The newline expression has a few things to explain. First, there's another action associated just with these newlines: newline();. This is a method in the code generate by ANTLR that keeps track of the line number of the input. At the beginning of this subexpression is an option to turn off the warnings of ambiguous productions. The problem is that both the "\r\n" and '\r' productions start with the same symbol. ANTLR's lexer generator does not like this kind of ambiguity, so it would normally complain about this. But we're certain this is okay, and the ordering of the productions gives "\r\n" the high priority. So we can safely turn off the abiguity warnings just for this subexpression.

The tests will green bar once again.