Calvin seal CS 382: Special Topics - Compilers
Fall 2004

Unit Testing & ANTLR: The Parser

Truth be told, the testing proposed here for a parser is probably more of a cross between unit and integration testing. We'll be used the scanner to generate tokens for the parser, and strict unit testing would suggest that we hide these details away.

So if anyone complains, you'll know why.

Parsing the tokens from the scanner results in an abstract syntax tree (AST) which more accurately represents the structure of the program. A scanner picks up the individual words, and parsing organizes those words and relates them together in a structure.

Let's start simple with the spreadsheet parser. First, I define a SpreadsheetParserTest class which extends the ANTLRTestCase class. Again, I need to implement the same three methods as with the scanner: makeLexer() will be defined as before; makeParser(TokenStream) receives (actually) a scanner and needs to return a parser; makeTreeParser() can be defined as before.

My makeParser() looks like this:

public TestableParser makeParser(TokenStream lexer) {
  return new SpreadsheetParser(lexer);
}

The TestableParser interface comes from my own ANTLR testing library; it helps ensure that my other classes have the support to do their testing.

Now, in the grammar file, I can start writing my grammar. A parser must be declared after the header and before the lexer. I start my parser declaration like so:

class SpreadsheetParser extends Parser;
options {
  buildAST=true;
        classHeaderSuffix=org.norecess.antlr.TestableParser;
}

The first line looks like the first line of the lexer, just with the substitutions you'd expect. The first option has your parser build an AST. (You might not want an AST if the actions in your grammar did all of the work you needed.) The second option here adds an implements to the source code of SpreadsheetParser; the syntax is odd, but it works.

Now my compiler complains about some methods that SpreadsheetParser should implement. That code is generated by ANTLR, so I can't change the code there. Instead, right after the options, I can write a block of class contents. For example:

{
        private boolean myFailure = false;
        private boolean myQuietErrors = false;
        public void reportError(RecognitionException ex) {
                if (shouldPrintErrors())
                        super.reportError(ex);
                myFailure = true;
        }
        public boolean successfulParse() { return !myFailure; }
        public void atEOF() {
                try {
                        match(EOF);
                } catch (Exception e) {
                        myFailure = true;
                }
        }
        public void setQuietErrors(boolean quietErrors) {
                myQuietErrors = quietErrors;
        }
        private boolean shouldPrintErrors() {
                return !myQuietErrors;
        }
}

This code is all added to SpreadsheetParser by ANTLR.

The first public method here is overriding a method from ANTLR's Parser class. Errors in parsing are reported by this method, and it's the other methods and variables that control when and where this happens.

The other public methods are all required by the TestableInterface. Some parsing errors fail quietly, and my test might green bar when it shouldn't. Also, I'll write some tests where I expect parsing errors. These methods help with that.

Now the code compiles. Finally I can start testing.

I'm going to start simple: recognize an integer literal as an equation. First my test:

public void testIntegerEquation() {
  assertNode(INTEGER, "123", produce("equation", "123"));
}

This doesn't compile because of the INTEGER constant. This constant comes from the SpreadsheetParserTokenTypes interface, and so I have my test-case class implement this interface.

Note the call to produce(). This method is provided by the ANTLRTestCase class. The first argument is the name of the method I want to call to parse the second argument, the input. produce() is written using Java's reflection library to find the equation() method that ANTLR (eventually) writes for me.

assertNode() takes three arguments. The last argument is the root of the AST that produced by the parser. Truth be told, it may not be an honest root; it might have siblings (as you'll see below). The first argument is the expected type of the root node (or first sibling); the second is the expected type of that same node.

I also have to add at least one rule in my grammar, so I add this:

equation
    : EOF
    ;

EOF is a predefined token for the end-of-file. equation is now a non-terminal which produces only EOF. (Incidentally, ANTLR requires non-terminals to be in all lowercase and tokens to be in all uppercase.) It's no coincidence that I've called this non-terminal equation and that I try to invoke a method named equation in my test method.

When I run my tests (since the code now compiles), my new test fails as I expect. To fix it, I add the INTEGER token before the EOF token in the production: an equation should start with an integer and end with the end-of-file.

Strangely, I've fixed the one failure---my integer parses just fine; but two old tests fail in the lexer! By using the INTEGER in the parser, I've thrown off the lexer's indexing of its tokens. I need to use the parser's indexing of all of the tokens because that's the "official" indexing. I fix this by using SpreadsheetParserTokenTypes in both test-case classes. I make the switch in SpreadsheetLexerTest, re-run the tests for my green bar.

I add another assertion to test other integer expressions.

The assertions so far only test the root of the AST that's returned. If the input is an integer, the only node in the AST should be the integer token. In turns out, that's not the case yet. ANTLRTestCase has methods to make this clear:

public void testIntegerSingleton() {
    assertNoNextSibling(produce("equation", "123"));
    assertNoFirstChild(produce("equation", "123"));
}

This tests to make sure that the node returned by the produce() method does not have any children or siblings. But when I run this and get a red bar, I discover that there is a sibling. It turns out that it's the EOF token. The end-of-file, though, is really just a structural consideration; the shape of the AST will take over its role. So I want ANTLR to match the EOF but not include it in the AST. The ANTLR operator for this is the exclamation point:

equation
    : INTEGER EOF!
    ;

Now the EOF token is matched and thrown away. My integer node is the whole tree.

This leads to another observation: EOF was a sibling of the integer node. So strictly speaking, I didn't really have a tree. ANTLR's default action when putting together an AST is to take the production and create a list. I have to specify where tree roots are found in the production. For this integer production, that's not a big deal.

So, next, I want to write some addition expressions. I start with my tests:

public void testAdditionEquation() {
   assertNode(OPERATOR, "+", produce("equation", "1+2"));
}

OPERATOR causes a problem because it's a type of node completely unknown to ANTLR. I could use the PLUS, MINUS, etc. token types that come from the lexer, but in the long run, it's better to have just the one generic OPERATOR type in the parser.

To define this new node type, I add a tokens section to my grammar definition. After the options and before the Java code, I add this:

tokens {
        OPERATOR;
}

Now my code is happy, and I get the expected red bar from my tests.

The fix for the red bar is to add a production to the equation non-terminal. I start really basic:

equation
        : INTEGER EOF!
        | INTEGER PLUS INTEGER EOF!
        ;

If I run my code with this declaration, I still get a red bar. I've ignored a warning from ANTLR: these two productions are ambiguous. Currently ANTLR is using just a one-token lookahead. But that's not enough to figure out which production to follow when the next token in an integer. It could be a single integer or an integer followed by a plus sign.

One solution would be to left factor the language. I loose some structure in the grammar, so I prefer to increase my lookahead. I do this in the options section by setting k=2;. This removes the ambiguity from the grammar: ANTLR is happy, and the compiler is happy. The unit test is still a red bar.

The problem is that I'm not yet returning an OPERATOR node. Before I worry about the type of node, I should first work on the structure. I'm actually return a list of nodes still. Instead, I want a root node with some children. I'm going to make the PLUS node my root like so:

: INTEGER PLUS^ INTEGER EOF!

My tests still red bar, but at least it's complaining about the PLUS token. That is the node that's being returned. The other nodes (the two INTEGERs since the EOF is ignored) are children of the PLUS node. So finally, I change the type of this node:

| INTEGER p:PLUS^ INTEGER EOF!
                { #p.setType(OPERATOR); }

The p: let's me refer to the PLUS node with a name. The action here (the code between the curley braces) is to refer to the p node and set its type to be an OPERATOR. I've left its text alone, so the "+" remains in that node.

My unit tests pass!

One thing bothers me now: I've had to deal with that EOF twice now. Perhaps I should refactor my grammar? I move the interesting productions down a level like so:

equation
        : e EOF!
        ;
e       : INTEGER
        | INTEGER p:PLUS^ INTEGER
                { #p.setType(OPERATOR); }
        ;

My unit tests still pass: the equation() returns the AST returned by e() (and throws away the EOF); e() does what equation() used to do.

So I add yet another assertion for addition equations:

assertNode(OPERATOR, "+", produce("equation", "1+2+3"));

It fails. There are many ways to solve this. One option would be to turn one or the other INTEGER tokens into recursive calls to e (e.g., e p:PLUS^ INTEGER). These lead to a variety of parsing problems. ANTLR provides its own solution which is a little more useful (at least I think so).

ANTLR allows regular expressions in these productions. To handle 1+2+3, I want to repeat the PLUS INTEGER pattern multiple times. The positive closure is really handy:

e     : INTEGER
        | INTEGER ( p:PLUS^ INTEGER { #p.setType(OPERATOR); } )+
        ;

The positive closure allows the input to have multiple additions in the same expression. Each plus sign ultimately becomes the root of a new AST. Since the parsing is done left to right, the first plus sign has the first two integers as its children; the second plus sign has the first AST and the third integer as its children; the third plus sign has the second AST and the fourth integer ast its children; and so on... That's exactly what we want since addition is left associative!

I'm not sure I believe this, so I write tests:

public void testAdditionEquation() {
    assertNode(OPERATOR, "+", produce("equation", "1+2"));
    assertNode(OPERATOR, "+", produce("equation", "1+2+3"));
    assertNode(OPERATOR, "+", produce("equation", "1+2+3").getFirstChild());
    assertNode(INTEGER, "1", produce("equation", "1+2+3").getFirstChild()
            .getFirstChild());
    assertNode(INTEGER, "2", produce("equation", "1+2+3").getFirstChild()
            .getFirstChild().getNextSibling());
    assertNode(INTEGER, "3", produce("equation", "1+2+3").getFirstChild()
            .getNextSibling());
}

[[As you might be able to tell, this is where my supporting code lacks the most: deep comparisons of ASTs!]]

I'm golden!

What about substraction?

public void testSubtractionEquation() {
    assertNode(OPERATOR, "-", produce("equation", "1-2"));
    assertNode(OPERATOR, "-", produce("equation", "1+2-3"));
    assertNode(OPERATOR, "+", produce("equation", "1+2-3").getFirstChild());
    assertNode(INTEGER, "1", produce("equation", "1+2-3").getFirstChild()
            .getFirstChild());
    assertNode(INTEGER, "2", produce("equation", "1+2-3").getFirstChild()
            .getFirstChild().getNextSibling());
    assertNode(INTEGER, "3", produce("equation", "1+2-3").getFirstChild()
            .getNextSibling());
}

Actually, this also tests addition. Regardless, I get a red bar. The solution is bulky:

e     : INTEGER
        | INTEGER
                (
                        ( p:PLUS^ INTEGER { #p.setType(OPERATOR); }
                        | m:MINUS^ INTEGER { #m.setType(OPERATOR); }
                        )
                )+
        ;

Alternation over the two operators doesn't work because you can't name them properly; making them roots must be done immediately. But this does green bar.