Strings
FrontEnd, Iteration 4
User Story
You need some specifications for what a string is: what kind of punctuation do you put around a literal string? do you put punctuation around it (e.g., LaTeX)? what special characters are allowed?
User Story #4
Strings in Schobbes A string literal is a sequence of characters in single quotes. Any character can be in a string except for a single quote.
It turns out that the "any character" is actually a blessing!
Acceptance Tests
Write at least four acceptance tests for interpreting strings in Schobbes.
Here's an example:
a simple string ==== source 'hey' ==== execution hey
Some things to consider in designing these tests:
- The empty string is always interesting to test.
- What about a string with something that looks like an integer in it?
- What about punctuation in a string?
- Try a full sentence.
A test with a string that looks like an integer might pass because ANTLR tries to do some error recovery, and when it strips off the single quotes, you have a valid integer!
To the Lexer!
You should get complaints from the lexer that '
and
other letters and punctuation aren't "viable alternatives".
Add a shouldRecognizeStrings()
to
SchobbesLexerTest
. Start with an assertion for the
empty string. Compilation error.
You have to predict the type of the string token;
STRING
seems reasonable:
assertToken(SchobbesParser.STRING, "''", myTester.scanInput("''"));
What might be a little bit surprising is that the single quotes
are in the text component of the token. Why not toss them away
since the single quotes are an indication in the source
code that this is string? (Once you have a STRING
token, the single quotes are just redundant.) The problem is that
ANTLR doesn't let you toss away any of the input in the lexer. It
turns out that the easiest place to get rid of these single quotes
is in the TIR builder.
To get rid of the compilation error, you need a
STRING
rule in your lexer.
Define a STRING
in your lexer as just the literal
'x'
. Red bar.
The complaint should be about the single quotes now.
Redefine STRING
to be two single quotes.
You'll have to escape the single quotes with backslashes in the ANTLR grammar.
Add two more assertions: one for a single character, the other for that same character repeated five times.
assertToken(SchobbesParser.STRING, "'a'", myTester.scanInput("'a'")); assertToken(SchobbesParser.STRING, "'aaaaa'", myTester.scanInput("'aaaaa'"));
This should motivate the use of a Kleene star.
Fix the STRING
rule. Green bar.
Add an assertion with many different lowercase letters only.
(Try to work in both 'a' and 'z'.) Red bar. Fix STRING
with a range. Green bar.
You could keep building the regular expression for
STRING
this way, although this is going to lead you to
a rather large and interesting alternation of ranges. Let's go for
one fell swoop.
Add at least three more assertions. Red bar.
Things to consider: digits, strings that look like an integer, capital letters, punctuation, spaces.
Here's the lexer operator for the fell swoop: ~'a'
will match any character except for an
a
. The ~
is known as a negation
operator; it negates the set of characters it operates on.
~('a'..'z')
would match any character not in
the range a
through z
.
Fix STRING
. Green bar. CIAT tests still fail.
To the Parser!
The error messages from the failing CIAT tests are not clearly due to the parser. The parser does the best it can and passes whatever it can to the TIR builder, and since the parser has made a mess of things, the TIR builder has a lot to complain about. But the problem right now is in the parser.
Create a test method shouldParseString()
in
SchobbesParserTest
with three assertions in it. Red
bar.
At least one of the assertions should involve the empty string. You don't have to be comprehensive with the types of characters in the string since that's the responsibility of the lexer. Just make sure the parser can recognize strings. Here's an example:
assertTree( SchobbesParser.STRING, "('hey')", myTester.scanInput("'hey'").parseAs("program") );
Watch out for those pesky single quotes. You still have to deal with them here!
The solution here is to allow a program
in the
parser to be an INTEGER
or a STRING
followed by an end-of-file. Based on experience with ANTLR, it is
very useful to let your top-level rule (i.e.,
program
, the one that deals with the end-of-file) to
be as simple as possible. So the first step here is to refactor
the grammar.
Comment out the string assertions in
SchobbesParserTest
. Green bar.
Create a new parser rule named atom
which parses
just an INTEGER
. Green bar.
ANTLR enforces the property that the name of all parser rules consist of lowercase letters and the name of lexer rules consist of uppercase letters.
program
right now (and after you define
atom
) parses an INTEGER
followed by an
EOF
.
Replace the INTEGER
in the program
rule with atom
. Green bar.
Uncomment the string assertions. Red bar.
So now a program
looks first for an
atom
. Just as in chemistry, an atom is a basic
building block of a larger program, like an integer or string. And
that's exactly what an atom
needs to be: an
INTEGER
or a STRING
.
And, yes, all of the ANTLR operators that you use in a lexer can also be used in a parser.
Fix the definition of atom
. Green bar. CIAT tests
still fail.
And to the TIR Builder!
The TIR builder is still complaining. If you look at it, it's no
wonder: you're only dealing with an INTEGER
AST
there.
Create a new method shouldBuildStringTIRs()
in
HobbesTIRBuilderTest
. Write three assertions for
StringETIR
s. Red bar.
Here's a example:
assertEquals( new StringETIR("'hey'"), myTester.scanInput("'hey'").parseAs("program").treeParseAs("program") );
It turns out that some of the strange errors from the TIR builder are due to that builder ignoring the end-of-file marker. Yes, the parser has already ignored it, but that's from the lexer's token stream. The parser itself, when it gives an AST to a tree grammar, adds yet another end-of-file (which should be called an "end-of-tree marker").
To get the builder on the right track, follow the same pattern
in the parser: a program
rule to process the AST and
handle the end-of-file; a second rule to do the actual
processing.
Rename the program
rule in the TIR builder as
expression
. Create a new program
rule to
handle the end-of-file. Red bar.
Here's that program
rule:
program returns [ExpressionTIR tir] : e=expression EOF { tir = e; } ;
The e=expression
calls the expression
rule which returns a TIR; that TIR is assigned to a variable
e
. The Java code attached to this rule assigns that
value to the return value, tir
. Since the end-of-file
(EOF
) is not used to build the return value, it's used
only to mark the end of the AST and does not contribute to the
TIR.
When this fails now, it will be a mismatch between your
StringETIR
and null
. This is because
expression
skips the INTEGER
case when it
encounters a STRING
. (Why it requires you to put
EOF
into your tree grammar before it does this very
sensible thing is a bit confusing to me.)
So what's the fix? Well, it'd be silly for ANTLR to introduce a bunch of new operators for the same actions in a tree grammar, so, yes, the alternation, concatenation, and Kleene stars work in your tree grammar.
It's important to know, though, that each alternative in an alternation can have it's own Java-code action.
forExample : i=INTEGER { System.out.println("I see an integer!"); } | s=STRING { System.out.println("I see a string!"); } ;
Fix expression
in the TIR builder. Green bar.
CIAT tests still fail.
Single Quote FAIL
All the other CIAT failures were technically errors because the Schobbes interpreter generated an error. At this point, the interpreter finishes without an error, but what it produces is wrong. That's a failure.
Read the failures.
It's time to get rid of those single quotes. Before you give the
text to the StringETIR
constructor, you need
to strip the single quotes off.
Fix the assertions in shouldBuildStringTIRs()
. Red
bar. Fix expression
in the TIR builder. Green bars all
around.
Hint: if only there were another tree grammar you could look at
where someone has already had to strip those quotes off for a
StringETIR
...