Addition and Other Operators
FrontEnd, Iteration 5

Not the Whole User Story

The user story you're shooting for is to be able to parse and interpret fully recursive arithmetic expressions. For this particular iteration, you'll worry about non-recursive expressions (in addition to the integers and strings you already handle).

So these are valid expressions:

(+ 1 2)
(* 8 9)
(- 6 3)
(/ 8 4)
(+ 'foo' 4)
(* 'foo' 4)
(< 'abc' 'xyz')
(= 'foo' 'foo')

These are not yet:

(+ 1 (* 2 3))
(* (/ 8 4) (+ 2 3))

Acceptance Tests

Write at least four acceptance tests for non-recursive arithmetic expressions. They fail.

The problem should be with the left parenthesis and the operators. Nothing in the front end knows how to handle them.

Lexer or Parser

Recognize an integer or a string is a matter of matching up against patterns. An expression as a whole will have to be recognized by the parser. But what about the parentheses and the individual operator characters? There aren't exactly patterns here: you match a + or you don't.

ANTLR allows you to match these in the parser directly. Some grammar generators require you to use a token defined by the lexer, and ANTLR gives you this as an option. So we'll split the difference and do the recognition in both the lexer and parser.

The Lexer

So let's start with recognizing the operators in the lexer.

Create a test method shouldRecognizeOperators() in HobbesLexerTest.

Here's an assertion for an operator token:

assertToken(HobbesParser.OPERATOR, "<", myTester.scanInput("<"));

Add this assertion to the test method. Compiler error.

HobbesParser.OPERATOR isn't defined.

Create a lexer rule in your grammar for OPERATOR that matches it with the letter a. Red bar.

You know the test is being triggered since you have the red bar.

I believe there are five other operators you need to match. And it seems to me that if you test each one, there's no need to do any "refute token" assertions.

Change the definition of OPERATOR to recognize the < character. Green bar.

Repeat for all of the other operators. Use = for the "equals" operator. Green bar for unit tests; CIAT tests still fail.

OPERATOR itself should match any of the operators. The text of the token will keep track of which operator you've actually matched.

The Parser

As mentioned above, many grammar generators will require you to recognize the left and right parentheses in your lexer, but let's try this in parser.

Create a new test method shouldParseExpressions() in HobbesParserTest.

An assertion:

assertTree(
  SchobbesParser.OPERATOR,
  "(<(1)(2))",
  myTester.scanInput("(< 1 2)").parseAs("program")
  );

Add this assertion. Red bar.

The assertion says that the root of the tree will be an OPERATOR. The second argument to the assertion is the preorder traversal of the AST. That is, the operator < is at the root, and (1) and (2) are the subtrees.

You should be able to parse any operator and any integer or string arguments as an OPERATOR expression.

Add two more assertions using different operators and arguments. And then comment all these new assertions out. Green bar.

The commenting out and green bar is so that you can do some refactoring. The parser, as it exists right now, has a program rule to get things started. It handles the EOF. It's not a good idea to confuse a rule that handles EOF. (My opinion based on getting burned too many times.)

The only other parser rule is the atom rule. Integers and strings are atoms because they're at the lowest, basest level of any expression we might ever write in Schobbes. An operator expression is not an atom.

So we need to introduce another rule for general expressions. Right now, the only kind of expression you have is an atom. So that means it should be easy to add an expression rule now.

Create a rule for expression that recognizes a single atom. Green bar. Replace the atom in the program rule with expression. Green bar. Uncomment the assertions. Red bar.

So, now that the grammar is ready for a new expression rule, let's analyze a possible expression. Let's consider (< 1 'foo') (while this doesn't interpret well (or at all), it will parse just fine).

The whitespace is tossed, so we expect the lexer to return these tokens: (, < (an OPERATOR), 1 (an INTEGER), 'foo' (a STRING), and ).

Is it important that it's a < in this expression? No! Any OPERATOR would be fine.

Is it important that it's 1 as the first argument? No! Is it important that it's an INTEGER? No! It's important (for now) that it's an atom.

Is it important that it's 'foo' as the second argument? No! Is it important that it's a STRING? No! It's important (for now) that it's an atom.

So an expression can be an OPERATOR followed by an atom followed by an atom.

All the rules you learned about lexer rules in ANTLR also apply to parsers. So if you want to match multiple things, just create a list of them (separated by whitespace).

If I were creating a grammar for very simple English sentences, I'd want to parse sentences that were a subject followed by a verb followed by an object:

sentence : NOUN VERB NOUN ;

This is using only lexer rules in its definition, but that's not a requirement. If I had a nounPhrase parser rule, I could use that instead.

So what about the punctuation? As mentioned at the beginning of this section of this iteration, some grammars will require you to match the parentheses in the lexer first, and you'd have to use tokens like LPAREN and RPAREN in the parser. Many ANTLR grammars follow this same tradition. But it looks a little bulky to me.

Consider, again, putting together English sentences:

sentence : NOUN VERB NOUN PERIOD ;
PERIOD : '.' ;

versus

sentence : NOUN VERB NOUN '.' ;

Since the punctuation is going to be thrown away, there's little point in creating a whole separate rule for it.

Add an alternative to the expression rule for a parenthesized operator expression. Red bar.

The failure should be over your preorder traversal of the AST. The problem is that the rule is matching the tokens from the lexer, but the parser isn't building the right AST.

All of your previous rules have had to return one token as the AST, including program since it ignores the EOF token! But with an operator expression, you need to bundle the operator and its arguments into the AST. Based on the preorder traversal, the operator should be at the root, and the two arguments as subtrees.

ANTLR provides a couple different ways to specify your AST. You'll use a rewrite rule. A rewrite rule looks like this: -> ^(root token subtrees). Each alternative for a rule can have its own rewrite. (Advanced hint: the atom alternative does not need a rewrite.)

ANTLR provides some incredible magic when writing the rewrite rule.

sentence
  : NOUN VERB NOUN '.'
    -> (VERB NOUN+)
  ;

First, the period is ignored in the AST because I never mention it. One problem solved.

Second, VERB becomes the root of the AST. You're free to rearrange the matched elements however you want in the AST.

Third and most powerful, the NOUN+ (kind of a reverse-Kleene star) results in adding both matched NOUNs as subtrees. It doesn't matter that the two NOUNs were separated by a VERB. All of them are collected in the AST as NOUN+.

If your TIR builder tests fail, make some inconsequential change to the TIR builder grammar, save it (so that the ANTLR IDE regenerates its Java code), and run your tests again.

This is because the parser has changed some of its token types, and the ANTLR IDE doesn't realize that the tree parser needs to be regenerated.

Write a rewrite rule for the operator-expression alternative of expression. Green bar.

The TIR Builder

Run the CIAT tests. They fail.

The problem will be in the TIR builder.

Create a test method shouldBuildOperatorETIRs() in SchobbesTIRBuilderTest. Add three assertions. Red bar.

An example assertion:

assertEquals(
  new OperatorETIR(
    new IntegerETIR(1), Operator.LESS_THAN, new IntegerETIR(2)
    ),
        myTester.scanInput("(< 1 2)")
          .parseAs("program")
          .treeParseAs("program")
        );

Keep in mind that strings have their single quotes stripped off by the TIR builder.

The solution here is to add another alternative to the expression rule of the TIR builder. But this begs a question: why isn't there a separate atom rule in the TIR builder?

In general, a tree parser can (and will, in my experience) have fewer rules than a parser. This is because the parser needs to be able to handle many different kinds of patterns, and some of these patterns are hard (or impossible) to detect all at the same time.

Check out the Hobbes parser sometime. The expression, term, and factor levels are necessary to get proper associativity and precedence in binary expressions. Contrast this with the Hobbes TIR builder which is much simpler.

A tree parser is based on the token types in the roots of ASTs—a very easy pattern! The hard work is already done. This results in fewer rules for the tree parser.

Let's tree parse an English sentence:

sentence returns [Sentence s]
  : ^(verb=VERB subject=NOUN object=NOUN)
    { s = new Sentence(subject.getText(), verb.getText(), object.getText()); }

Since I matched lexer tokens directly in the parser, I use those tokens in the pattern. verb, subject, and object are names I give to the components of the AST so that I can use them in the Java code.

Later on, if I switched to having full noun phrases, the rule might become this:

sentence returns [Sentence s]
  : ^(verb=VERB subject=nounPhrase object=nounPhrase)
    { s = new Sentence(subject, verb.getText(), object); }

where nounPhrase would have to be another rule in the tree parser. In other grammars, it might even make sense to call a rule recursively!

Recursively calling the expression rule for the two subtrees will take care of them. Converting the operator will take just a little work. Here's an assertion from the CITKit library:

assertEquals(
  OperatorETIR.Operator.MULTIPLY, 
  OperatorETIR.Operator.convertPunctuation("*")
  );

You give the text of an OPERATOR token to this static method, and you'll get the right Operator out. (Also take a look at the Hobbes TIR builder. It has a special case to handle (the equals operator), and it has a more interesting AST to deal with, but it does use this method.)

Add an alternative to expression in the TIR builder to handle operator expressions. Green bar. CIAT tests all pass!

CS 214: Programming Languages Spring 2009

Addition and Other Operators FrontEnd, Iteration 5