Strings
FrontEnd, Iteration 4

User Story

You need some specifications for what a string is: what kind of punctuation do you put around a literal string? do you put punctuation around it (e.g., LaTeX)? what special characters are allowed?

User Story #4

Strings in Schobbes A string literal is a sequence of characters in single quotes. Any character can be in a string except for a single quote.

It turns out that the "any character" is actually a blessing!

Acceptance Tests

Write at least four acceptance tests for interpreting strings in Schobbes.

Here's an example:

a simple string
==== source
'hey'
==== execution
hey

Some things to consider in designing these tests:

The empty string is always interesting to test.
What about a string with something that looks like an integer in it?
What about punctuation in a string?
Try a full sentence.

A test with a string that looks like an integer might pass because ANTLR tries to do some error recovery, and when it strips off the single quotes, you have a valid integer!

To the Lexer!

You should get complaints from the lexer that ' and other letters and punctuation aren't "viable alternatives".

Add a shouldRecognizeStrings() to SchobbesLexerTest. Start with an assertion for the empty string. Compilation error.

You have to predict the type of the string token; STRING seems reasonable:

assertToken(SchobbesParser.STRING, "''", myTester.scanInput("''"));

What might be a little bit surprising is that the single quotes are in the text component of the token. Why not toss them away since the single quotes are an indication in the source code that this is string? (Once you have a STRING token, the single quotes are just redundant.) The problem is that ANTLR doesn't let you toss away any of the input in the lexer. It turns out that the easiest place to get rid of these single quotes is in the TIR builder.

To get rid of the compilation error, you need a STRING rule in your lexer.

Define a STRING in your lexer as just the literal 'x'. Red bar.

The complaint should be about the single quotes now.

Redefine STRING to be two single quotes.

You'll have to escape the single quotes with backslashes in the ANTLR grammar.

Add two more assertions: one for a single character, the other for that same character repeated five times.

assertToken(SchobbesParser.STRING, "'a'", myTester.scanInput("'a'"));
assertToken(SchobbesParser.STRING, "'aaaaa'", myTester.scanInput("'aaaaa'"));

This should motivate the use of a Kleene star.

Fix the STRING rule. Green bar.

Add an assertion with many different lowercase letters only. (Try to work in both 'a' and 'z'.) Red bar. Fix STRING with a range. Green bar.

You could keep building the regular expression for STRING this way, although this is going to lead you to a rather large and interesting alternation of ranges. Let's go for one fell swoop.

Add at least three more assertions. Red bar.

Things to consider: digits, strings that look like an integer, capital letters, punctuation, spaces.

Here's the lexer operator for the fell swoop: ~'a' will match any character except for an a. The ~ is known as a negation operator; it negates the set of characters it operates on. ~('a'..'z') would match any character not in the range a through z.

Fix STRING. Green bar. CIAT tests still fail.

To the Parser!

The error messages from the failing CIAT tests are not clearly due to the parser. The parser does the best it can and passes whatever it can to the TIR builder, and since the parser has made a mess of things, the TIR builder has a lot to complain about. But the problem right now is in the parser.

Create a test method shouldParseString() in SchobbesParserTest with three assertions in it. Red bar.

At least one of the assertions should involve the empty string. You don't have to be comprehensive with the types of characters in the string since that's the responsibility of the lexer. Just make sure the parser can recognize strings. Here's an example:

assertTree(
  SchobbesParser.STRING, 
  "('hey')", 
  myTester.scanInput("'hey'").parseAs("program")
  );

Watch out for those pesky single quotes. You still have to deal with them here!

The solution here is to allow a program in the parser to be an INTEGER or a STRING followed by an end-of-file. Based on experience with ANTLR, it is very useful to let your top-level rule (i.e., program, the one that deals with the end-of-file) to be as simple as possible. So the first step here is to refactor the grammar.

Comment out the string assertions in SchobbesParserTest. Green bar.

Create a new parser rule named atom which parses just an INTEGER. Green bar.

ANTLR enforces the property that the name of all parser rules consist of lowercase letters and the name of lexer rules consist of uppercase letters.

program right now (and after you define atom) parses an INTEGER followed by an EOF.

Replace the INTEGER in the program rule with atom. Green bar.

Uncomment the string assertions. Red bar.

So now a program looks first for an atom. Just as in chemistry, an atom is a basic building block of a larger program, like an integer or string. And that's exactly what an atom needs to be: an INTEGER or a STRING.

And, yes, all of the ANTLR operators that you use in a lexer can also be used in a parser.

Fix the definition of atom. Green bar. CIAT tests still fail.

And to the TIR Builder!

The TIR builder is still complaining. If you look at it, it's no wonder: you're only dealing with an INTEGER AST there.

Create a new method shouldBuildStringTIRs() in HobbesTIRBuilderTest. Write three assertions for StringETIRs. Red bar.

Here's a example:

assertEquals(
  new StringETIR("'hey'"),
  myTester.scanInput("'hey'").parseAs("program").treeParseAs("program")
  );

It turns out that some of the strange errors from the TIR builder are due to that builder ignoring the end-of-file marker. Yes, the parser has already ignored it, but that's from the lexer's token stream. The parser itself, when it gives an AST to a tree grammar, adds yet another end-of-file (which should be called an "end-of-tree marker").

To get the builder on the right track, follow the same pattern in the parser: a program rule to process the AST and handle the end-of-file; a second rule to do the actual processing.

Rename the program rule in the TIR builder as expression. Create a new program rule to handle the end-of-file. Red bar.

Here's that program rule:

program returns [ExpressionTIR tir]
  : e=expression EOF
    { tir = e; }
  ;

The e=expression calls the expression rule which returns a TIR; that TIR is assigned to a variable e. The Java code attached to this rule assigns that value to the return value, tir. Since the end-of-file (EOF) is not used to build the return value, it's used only to mark the end of the AST and does not contribute to the TIR.

When this fails now, it will be a mismatch between your StringETIR and null. This is because expression skips the INTEGER case when it encounters a STRING. (Why it requires you to put EOF into your tree grammar before it does this very sensible thing is a bit confusing to me.)

So what's the fix? Well, it'd be silly for ANTLR to introduce a bunch of new operators for the same actions in a tree grammar, so, yes, the alternation, concatenation, and Kleene stars work in your tree grammar.

It's important to know, though, that each alternative in an alternation can have it's own Java-code action.

forExample
  : i=INTEGER
    { System.out.println("I see an integer!"); }
  | s=STRING
    { System.out.println("I see a string!"); }
  ;

Fix expression in the TIR builder. Green bar. CIAT tests still fail.

Single Quote FAIL

All the other CIAT failures were technically errors because the Schobbes interpreter generated an error. At this point, the interpreter finishes without an error, but what it produces is wrong. That's a failure.

Read the failures.

It's time to get rid of those single quotes. Before you give the text to the StringETIR constructor, you need to strip the single quotes off.

Fix the assertions in shouldBuildStringTIRs(). Red bar. Fix expression in the TIR builder. Green bars all around.

Hint: if only there were another tree grammar you could look at where someone has already had to strip those quotes off for a StringETIR...

CS 214: Programming Languages Spring 2009

Strings FrontEnd, Iteration 4