![]() |
CS 382: Special Topics - Compilers Fall 2004 |
||||||
Unit Testing & ANTLR: The Scanner |
|||||||
Unit Testing a ScannerThe first step in writing your scanner is determining what the lexical tokens are in your language. In my spreadsheet language, I have to worry about several types of information:
This is a fairly simple language, but it turns out that there are still quite a few lexical tokens for me to recognize. I start by creating my JUnit test-case class named
I do need a good definition for
protected TokenStream makeLexer(String input) {
return new SpreadsheetLexer(new StringReader(input));
}
This constructor is currently undefined because I haven't
started my ANTLR lexer yet. When I get it going, ANTLR will create
a constructor for me that receives a So now I create my lexer. I open a file named
header {
package edu.calvin.compilers.spreadsheet;
}
This sets the Java pacakge where the ANTLR output will be found. Then I can start my lexer right after this:
class SpreadsheetLexer extends Lexer;
options {
charVocabulary='\0'..'\377' | '\u1000'..'\u1fff';
testLiterals=false;
}
The first line here creates a class named
ANTLR will be unhappy with this since there aren't any
productions yet. I create a test in
public void testInteger() throws TokenStreamException {
assertToken(INTEGER, "123", "123");
}
The ANTLR-testing library defined this Keep in mind that although we might think of an integer as a
Java
Finally, to get all of the compilation errors to disappear, I write my production: INTEGER : 'q' ; This is dumb definition, but this will allow everything to compile. When I run my tests, I'll get a red bar, proving that my test is being executed. After I get this red bar, I put in a better definition:
INTEGER
: ('0'..'9')+
;
An integer is any sequence of digits; the Now when I run my unit tests, I get a green bar. I can do a little refactoring here. Perhaps I'd like to encapsulate the range of character that can appear in an integer. I can create this rule: protected DIGIT : '0'..'9' ; The I now add a couple more integer assertions to
Next, I can move on to recognizing cell addresses. I add this
test method to
public void testAddress() {
assertToken(ADDRESS, "A2", "A2");
}
Once again, ADDRESS : 'q' ; Compile and run my test for red bar. So I fix the definition of an address. An address in a spreadsheet equation consists of letters followed by digits. My test suggests that I need only worry about one letter and one digit, so:
ADDRESS
: ('A'..'Z') DIGIT
;
I run my unit tests for a green bar. My unit tests aren't
complete, but first I want to do some more lexer refactoring. Using
a range of characters for the letters seems odd when I use
protected LETTER : 'A'..'Z' ; And I change the definition of Now I fix the limitations of my addresses. In particular, this will give me another red bar: assertToken(ADDRESS, "ZA2", "ZA2"); I add this to ADDRESS : (LETTER)+ DIGIT ; My unit tests pass again. Finally, I add another test with more than one digit in an
address; I run my unit tests for a red bar. Then I fix
My next step in writing my lexer would be to test the operator
tokens: Eventually, I may discover that I want to deal with whitespace.
That is, instead of forcing my user to type So I add yet another test:
public void testWhitespace() throws TokenStreamException {
assertToken(INTEGER, "123", " 123");
}
Notice all of the spaces at the beginning of the input. My expected value says those should disappear. Yet, when I run this test, I get a red bar, and the complaint is (basically) that an integer cannot start with a space. It's easy enough to create a production for whitespace: WHITESPACE : ' ' ; Now the test will still fail, but the complain will be that the
next token is a
WHITESPACE
: ' '
{ $setType(Token.SKIP); }
;
The However, this only tests spaces. There are tabs and form feeds to worry about: assertToken(INTEGER, "123", "\t\f123"); This test red bars; to fix it, I add the new whitespace characters to the whitespace list:
WHITESPACE
: ( ' ' | '\t' | '\f' )
{ $setType(Token.SKIP); }
;
Lastly, I have to deal with newlines. There are two tricks here: (1) I need to let ANTLR know when the lexer has come across a newline (at least if we want our line counter to be accurate); (2) a newline may be represented three different ways (depending on the operating system that created the file). First, I create this test: assertToken(INTEGER, "123", "\n123"); It red bars, of course. To fix it, I turn the whitespace production into this monstrosity:
WHITESPACE
: ( ' ' | '\t' | '\f'
| ( options generateAmbigWarnings=false;
: "\r\n" // DOS/Windows
| '\r' // Macintosh
| '\n' // Unix
) newline();
) $setType(Token.SKIP);
;
At a high level (the outermost set of parentheses), this says
that a whitespace character is a space, a tab, a form feed, or a
newline, and this token should be skipped. The newline expression
has a few things to explain. First, there's another action
associated just with these newlines: The tests will green bar once again. |
|||||||
|