Addition and Other Operators
FrontEnd, Iteration 5
Not the Whole User Story
The user story you're shooting for is to be able to parse and interpret fully recursive arithmetic expressions. For this particular iteration, you'll worry about non-recursive expressions (in addition to the integers and strings you already handle).
So these are valid expressions:
(+ 1 2) (* 8 9) (- 6 3) (/ 8 4) (+ 'foo' 4) (* 'foo' 4) (< 'abc' 'xyz') (= 'foo' 'foo')
These are not yet:
(+ 1 (* 2 3)) (* (/ 8 4) (+ 2 3))
Acceptance Tests
Write at least four acceptance tests for non-recursive arithmetic expressions. They fail.
The problem should be with the left parenthesis and the operators. Nothing in the front end knows how to handle them.
Lexer or Parser
Recognize an integer or a string is a matter of matching up
against patterns. An expression as a whole will have to be
recognized by the parser. But what about the parentheses and the
individual operator characters? There aren't exactly patterns here:
you match a +
or you don't.
ANTLR allows you to match these in the parser directly. Some grammar generators require you to use a token defined by the lexer, and ANTLR gives you this as an option. So we'll split the difference and do the recognition in both the lexer and parser.
The Lexer
So let's start with recognizing the operators in the lexer.
Create a test method shouldRecognizeOperators()
in
HobbesLexerTest
.
Here's an assertion for an operator token:
assertToken(HobbesParser.OPERATOR, "<", myTester.scanInput("<"));
Add this assertion to the test method. Compiler error.
HobbesParser.OPERATOR
isn't defined.
Create a lexer rule in your grammar for OPERATOR
that matches it with the letter a
. Red bar.
You know the test is being triggered since you have the red bar.
I believe there are five other operators you need to match. And it seems to me that if you test each one, there's no need to do any "refute token" assertions.
Change the definition of OPERATOR
to recognize the
<
character. Green bar.
Repeat for all of the other operators. Use =
for
the "equals" operator. Green bar for unit tests; CIAT tests still
fail.
OPERATOR
itself should match any of the operators.
The text of the token will keep track of which operator you've
actually matched.
The Parser
As mentioned above, many grammar generators will require you to recognize the left and right parentheses in your lexer, but let's try this in parser.
Create a new test method shouldParseExpressions()
in HobbesParserTest
.
An assertion:
assertTree( SchobbesParser.OPERATOR, "(<(1)(2))", myTester.scanInput("(< 1 2)").parseAs("program") );
Add this assertion. Red bar.
The assertion says that the root of the tree will be an
OPERATOR
. The second argument to the assertion is the
preorder traversal of the AST. That is, the operator
<
is at the root, and (1)
and
(2)
are the subtrees.
You should be able to parse any operator and any integer or
string arguments as an OPERATOR
expression.
Add two more assertions using different operators and arguments. And then comment all these new assertions out. Green bar.
The commenting out and green bar is so that you can do some
refactoring. The parser, as it exists right now, has a
program
rule to get things started. It handles the
EOF
. It's not a good idea to confuse a rule that
handles EOF
. (My opinion based on getting burned too
many times.)
The only other parser rule is the atom
rule.
Integers and strings are atoms because they're at the lowest,
basest level of any expression we might ever write in Schobbes. An
operator expression is not an atom.
So we need to introduce another rule for general
expressions. Right now, the only kind of expression you
have is an atom. So that means it should be easy to add an
expression
rule now.
Create a rule for expression
that recognizes a
single atom
. Green bar. Replace the atom
in the program
rule with expression
.
Green bar. Uncomment the assertions. Red bar.
So, now that the grammar is ready for a new
expression
rule, let's analyze a possible expression.
Let's consider (< 1 'foo')
(while this doesn't
interpret well (or at all), it will parse just
fine).
The whitespace is tossed, so we expect the lexer to return these
tokens: (
, <
(an
OPERATOR
), 1
(an INTEGER
),
'foo'
(a STRING
), and )
.
Is it important that it's a <
in this
expression? No! Any OPERATOR
would be fine.
Is it important that it's 1
as the first argument?
No! Is it important that it's an INTEGER
? No! It's
important (for now) that it's an atom
.
Is it important that it's 'foo'
as the second
argument? No! Is it important that it's a STRING
? No!
It's important (for now) that it's an atom
.
So an expression can be an OPERATOR
followed by an
atom
followed by an atom
.
All the rules you learned about lexer rules in ANTLR also apply to parsers. So if you want to match multiple things, just create a list of them (separated by whitespace).
If I were creating a grammar for very simple English sentences, I'd want to parse sentences that were a subject followed by a verb followed by an object:
sentence : NOUN VERB NOUN ;
This is using only lexer rules in its definition, but that's not
a requirement. If I had a nounPhrase
parser rule, I
could use that instead.
So what about the punctuation? As mentioned at the beginning of
this section of this iteration, some grammars will require you to
match the parentheses in the lexer first, and you'd have to use
tokens like LPAREN
and RPAREN
in the
parser. Many ANTLR grammars follow this same tradition. But it
looks a little bulky to me.
Consider, again, putting together English sentences:
sentence : NOUN VERB NOUN PERIOD ; PERIOD : '.' ;
versus
sentence : NOUN VERB NOUN '.' ;
Since the punctuation is going to be thrown away, there's little point in creating a whole separate rule for it.
Add an alternative to the expression
rule
for a parenthesized operator expression. Red bar.
The failure should be over your preorder traversal of the AST. The problem is that the rule is matching the tokens from the lexer, but the parser isn't building the right AST.
All of your previous rules have had to return one token
as the AST, including program
since it ignores the
EOF
token! But with an operator expression, you need
to bundle the operator and its arguments into the AST. Based on the
preorder traversal, the operator should be at the root, and the two
arguments as subtrees.
ANTLR provides a couple different ways to specify your AST.
You'll use a rewrite rule. A rewrite rule looks
like this: -> ^(root token
subtrees)
. Each alternative for a rule can have its
own rewrite. (Advanced hint: the atom
alternative does
not need a rewrite.)
ANTLR provides some incredible magic when writing the rewrite rule.
sentence : NOUN VERB NOUN '.' -> (VERB NOUN+) ;
First, the period is ignored in the AST because I never mention it. One problem solved.
Second, VERB
becomes the root of the AST. You're
free to rearrange the matched elements however you want in the
AST.
Third and most powerful, the NOUN+
(kind of a
reverse-Kleene star) results in adding both matched
NOUN
s as subtrees. It doesn't matter that the two
NOUN
s were separated by a VERB
. All of
them are collected in the AST as NOUN+
.
If your TIR builder tests fail, make some inconsequential change to the TIR builder grammar, save it (so that the ANTLR IDE regenerates its Java code), and run your tests again.
This is because the parser has changed some of its token types, and the ANTLR IDE doesn't realize that the tree parser needs to be regenerated.
Write a rewrite rule for the operator-expression alternative of
expression
. Green bar.
The TIR Builder
Run the CIAT tests. They fail.
The problem will be in the TIR builder.
Create a test method shouldBuildOperatorETIRs()
in
SchobbesTIRBuilderTest
. Add three assertions. Red
bar.
An example assertion:
assertEquals( new OperatorETIR( new IntegerETIR(1), Operator.LESS_THAN, new IntegerETIR(2) ), myTester.scanInput("(< 1 2)") .parseAs("program") .treeParseAs("program") );
Keep in mind that strings have their single quotes stripped off by the TIR builder.
The solution here is to add another alternative to the
expression
rule of the TIR builder. But this begs a
question: why isn't there a separate atom
rule in the
TIR builder?
In general, a tree parser can (and will, in my experience) have fewer rules than a parser. This is because the parser needs to be able to handle many different kinds of patterns, and some of these patterns are hard (or impossible) to detect all at the same time.
Check out the Hobbes parser sometime. The
expression
, term
, and factor
levels are necessary to get proper associativity and
precedence in binary expressions. Contrast this with the Hobbes TIR
builder which is much simpler.
A tree parser is based on the token types in the roots of ASTs—a very easy pattern! The hard work is already done. This results in fewer rules for the tree parser.
Let's tree parse an English sentence:
sentence returns [Sentence s] : ^(verb=VERB subject=NOUN object=NOUN) { s = new Sentence(subject.getText(), verb.getText(), object.getText()); }
Since I matched lexer tokens directly in the parser, I use those
tokens in the pattern. verb
, subject
, and
object
are names I give to the components of the AST
so that I can use them in the Java code.
Later on, if I switched to having full noun phrases, the rule might become this:
sentence returns [Sentence s] : ^(verb=VERB subject=nounPhrase object=nounPhrase) { s = new Sentence(subject, verb.getText(), object); }
where nounPhrase
would have to be another rule in
the tree parser. In other grammars, it might even make sense to
call a rule recursively!
Recursively calling the expression
rule for the two
subtrees will take care of them. Converting the operator will take
just a little work. Here's an assertion from the CITKit
library:
assertEquals( OperatorETIR.Operator.MULTIPLY, OperatorETIR.Operator.convertPunctuation("*") );
You give the text of an OPERATOR
token to this
static method, and you'll get the right Operator
out.
(Also take a look at the Hobbes TIR builder. It has a special case
to handle (the equals operator), and it has a more interesting AST
to deal with, but it does use this method.)
Add an alternative to expression
in the TIR builder
to handle operator expressions. Green bar. CIAT tests all pass!