CS 214: Programming Languages
Spring 2009

Home|Syllabus|Schedule
<<|>>|ANTLR API|ANTLR Testing API|CITkit API

Integers
FrontEnd, Iteration 3

Crash Course in ANTLR Lexers

ANTLR uses (mostly) regular expressions to recognize tokens in the lexer. Here are the operators you can use in ANTLR's metalanguage for lexers.

Help yourself out by opening Hobbes.g in the Hobbes front end JAR file.

Overall Syntax

ANTLR always uses single quotes for strings of any length.

A grammar consists of rules. Rules are written in ANTLR in this form:

TOKEN : pattern ;

The colon and semicolon are required. You can put whatever whitespace you want around the punctuation. The TOKEN becomes a token type that the lexer recognizes, and it is used as the name of a constant in the lexer and parser classes. The pattern is a regular expression.

Typically, lexer rules are written at the end of a grammar file.

Alternation

VOWEL: 'a' | 'e' | 'i' | 'o' | 'u';

Any one of these characters is recognized as a VOWEL.

assertToken(VOWEL, "a", myTester.scanInput("a"));
assertToken(VOWEL, "o", myTester.scanInput("o"));
assertToken(VOWEL, "e", myTester.scanInput("e"));

But only one of them at a time:

refuteToken(VOWEL, myTester.scanInput("ae"));
refuteToken(VOWEL, myTester.scanInput("aeio"));

Ranges

ANTLR allows you to specify ranges of characters:

DIGIT
  : '0'..'9'
  ;

The range is based on the Unicode order of the characters. If you're not sure the characters you're interested in are consecutive, consult a Unicode chart or use an explicit alternation.

To prove a point. Never write a loop in your tests.

for (char ch = '0'; ch <= '9'; ch++) {
  String str = String.valueOf(ch);
  assertToken(DIGIT, str, myTester.scanInput(str));
}

This is a special form of alternation, so again it only matches one character from the input.

Concatenation

What if you want to match several characters in a row?

FOOBAR
  : 'foobar'
  ;

This recognizes one stream of characters only.

assertToken(FOOBAR, "foobar", myTester.scanInput("foobar"));

Nothing else matches, even variations on "foobar".

refuteToken(FOOBAR, myTester.scanInput("Foobar"));
refuteToken(FOOBAR, myTester.scanInput("foo"));
refuteToken(FOOBAR, myTester.scanInput("foo bar"));
refuteToken(FOOBAR, myTester.scanInput("foobar4"));
refuteToken(FOOBAR, myTester.scanInput("123"));

You can concatenate other rules:

SILLY
  : FOOBAR DIGIT
  ;
assertToken(SILLY, "foobar4", myTester.scanInput("foobar4"));

Kleene Star

Repetition is accomplished in a regular expression with a Kleene star.

WHY
  : FOOBAR*

This matches any number of FOOBARs.

assertToken(WHY, "", myTester.scanInput(""));
assertToken(WHY, "foobar", myTester.scanInput("foobar"));
assertToken(
  WHY, "foobarfoobarfoobar", 
  myTester.scanInput("foobarfoobarfoobar")
  );

Sometimes you need to match one or more:

WHY
  : FOOBAR+
refuteToken(WHY, myTester.scanInput(""));
assertToken(WHY, "foobar", myTester.scanInput("foobar"));
assertToken(
  WHY, "foobarfoobarfoobar", 
  myTester.scanInput("foobarfoobarfoobar")
  );

* and + have a high precedence, so parenthesize the expression being repeated.

Lexer Processing

ANTLR generates Java code from your grammar which will automatically pick the best tokens to use based on the input characters. As characters from the input are consumed, tokens are spit out for the parser.

You don't need to worry about your lexer at this level. This processing is part of the "generate Java code for a lexer" deal with using ANTLR (or any front-end generator).

What you do need to worry about is what each lexer rule does and does not try to recognize.

Interpreting Schobbes Integers

Write three more CIAT tests for Schobbes involving integers. CIAT tests red bar.

In order to write effective tests, you need to know what we consider to be an integer: an integer consists of any non-zero digit followed by any digit 0 through 9.

The problem will start in the lexer. Read the error message from the CIAT tests to see this.

Scanning Integers

Write three more assertions for shouldRecognizeIntegers() in SchobbesLexerTest using different integers. Red bar.

Do not test negative numbers. Do not test numbers larger than 231. Do not test floating point numbers. Do work in every possible digit (0 through 9). Do test numbers of different magnitudes.

Create a new test method shouldRejectNonIntegers() with five rejections. Red bar, but not for this method!

Rejections are written using refuteToken(TYPE, myTester.scanInput(input));, as seen above. The input string might be valid code, but refuteToken() will reject it if it doesn't scan as one token of the specified type.

Some things to reject: input that's clearly not an integer (like "hey"), a negative number (for now), a string ("hey"), floating point numbers, two integers ("123 456"). Do not worry about numbers that are too large since this is easier to handle at a later stage.

Fix your Schobbes grammar. Green bar.

Parsing Integers

Run the CIAT tests. They fail.

The failure shouldn't come from the parser, though. You changed its definition of an INTEGER, but it's fine with the new definition. However, you should add some unit tests to make sure it stays working.

Add three more assertions to shouldParseIntegers(). Green bar.

Integer TIRs

The CIAT tests are complaining about the TIR builder. This is because no matter what integer you parse, the TIR builder always returns a 5!

Add three more assertions to shouldBuildIntegerTIRs(). Red bar.

Presently, you should have a tree parser with this rule:

program returns [ExpressionTIR tir]
  : INTEGER
    { tir = new IntegerETIR(5); }
  ;

The INTEGER in this rule matches any INTEGER that the tree builder finds. You just need some way to refer to that integer in the Java code so that you can replace that 5 with something more general. The syntax for this is fairly simple: i=INTEGER instead of just the INTEGER. The i is now a variable that you can use in the Java code for this rule.

Instead of the 5, you want to get the text from the INTEGER in i. Well, i is a org.antlr.runtime.tree.Tree, and if that had a way to get the text (as a String)...

And if there were an IntegerETIR constructor that took a String...

Fix the tree builder. Green bars all around.

Hexadecimal Numbers

User Story #3

Recognize hexadecimal numbers Schobbes allows integers to be entered in hexadecimal with a 0x prefix and digits 0 through 9 and a through f (lowercase only).

Write at least three CIAT tests. The output will be in decimal. Red bar.

The entirety of the work for this is in the lexer. The IntegerETIR constructor already handles hexadecimal numbers. The parser and TIR builder otherwise only care about recognizing INTEGERs, not what kind of integer they are.

So add to the INTEGER rule.

Keep in mind that a through f are valid digits only if the integer begins with 0x. Write refuting assertions to ensure this!

Write a test method shouldRecognizeHexadecimalNumbers() with at least four assertions. Red bar. Fix the lexer. Green bars all around.