Integers
FrontEnd, Iteration 3
Crash Course in ANTLR Lexers
ANTLR uses (mostly) regular expressions to recognize tokens in the lexer. Here are the operators you can use in ANTLR's metalanguage for lexers.
Help yourself out by opening Hobbes.g
in the Hobbes
front end JAR file.
Overall Syntax
ANTLR always uses single quotes for strings of any length.
A grammar consists of rules. Rules are written in ANTLR in this form:
TOKEN : pattern ;
The colon and semicolon are required. You can put whatever
whitespace you want around the punctuation. The TOKEN
becomes a token type that the lexer recognizes, and it is used as
the name of a constant in the lexer and parser classes. The
pattern
is a regular expression.
Typically, lexer rules are written at the end of a grammar file.
Alternation
VOWEL: 'a' | 'e' | 'i' | 'o' | 'u';
Any one of these characters is recognized as a
VOWEL
.
assertToken(VOWEL, "a", myTester.scanInput("a")); assertToken(VOWEL, "o", myTester.scanInput("o")); assertToken(VOWEL, "e", myTester.scanInput("e"));
But only one of them at a time:
refuteToken(VOWEL, myTester.scanInput("ae")); refuteToken(VOWEL, myTester.scanInput("aeio"));
Ranges
ANTLR allows you to specify ranges of characters:
DIGIT : '0'..'9' ;
The range is based on the Unicode order of the characters. If you're not sure the characters you're interested in are consecutive, consult a Unicode chart or use an explicit alternation.
To prove a point. Never write a loop in your tests.
for (char ch = '0'; ch <= '9'; ch++) { String str = String.valueOf(ch); assertToken(DIGIT, str, myTester.scanInput(str)); }
This is a special form of alternation, so again it only matches one character from the input.
Concatenation
What if you want to match several characters in a row?
FOOBAR : 'foobar' ;
This recognizes one stream of characters only.
assertToken(FOOBAR, "foobar", myTester.scanInput("foobar"));
Nothing else matches, even variations on
"foobar"
.
refuteToken(FOOBAR, myTester.scanInput("Foobar")); refuteToken(FOOBAR, myTester.scanInput("foo")); refuteToken(FOOBAR, myTester.scanInput("foo bar")); refuteToken(FOOBAR, myTester.scanInput("foobar4")); refuteToken(FOOBAR, myTester.scanInput("123"));
You can concatenate other rules:
SILLY : FOOBAR DIGIT ;
assertToken(SILLY, "foobar4", myTester.scanInput("foobar4"));
Kleene Star
Repetition is accomplished in a regular expression with a Kleene star.
WHY : FOOBAR*
This matches any number of FOOBAR
s.
assertToken(WHY, "", myTester.scanInput("")); assertToken(WHY, "foobar", myTester.scanInput("foobar")); assertToken( WHY, "foobarfoobarfoobar", myTester.scanInput("foobarfoobarfoobar") );
Sometimes you need to match one or more:
WHY : FOOBAR+
refuteToken(WHY, myTester.scanInput("")); assertToken(WHY, "foobar", myTester.scanInput("foobar")); assertToken( WHY, "foobarfoobarfoobar", myTester.scanInput("foobarfoobarfoobar") );
*
and +
have a high precedence, so
parenthesize the expression being repeated.
Lexer Processing
ANTLR generates Java code from your grammar which will automatically pick the best tokens to use based on the input characters. As characters from the input are consumed, tokens are spit out for the parser.
You don't need to worry about your lexer at this level. This processing is part of the "generate Java code for a lexer" deal with using ANTLR (or any front-end generator).
What you do need to worry about is what each lexer rule does and does not try to recognize.
Interpreting Schobbes Integers
Write three more CIAT tests for Schobbes involving integers. CIAT tests red bar.
In order to write effective tests, you need to know what we consider to be an integer: an integer consists of any non-zero digit followed by any digit 0 through 9.
The problem will start in the lexer. Read the error message from the CIAT tests to see this.
Scanning Integers
Write three more assertions for
shouldRecognizeIntegers()
in
SchobbesLexerTest
using different integers. Red
bar.
Do not test negative numbers. Do not test numbers larger than 231. Do not test floating point numbers. Do work in every possible digit (0 through 9). Do test numbers of different magnitudes.
Create a new test method shouldRejectNonIntegers()
with five rejections. Red bar, but not for this method!
Rejections are written using refuteToken(TYPE,
myTester.scanInput(input));
, as seen above. The
input
string might be valid code, but
refuteToken()
will reject it if it doesn't scan as
one token of the specified type.
Some things to reject: input that's clearly not an integer (like
"hey"
), a negative number (for now), a string
("hey"
), floating point numbers, two integers
("123 456"
). Do not worry about numbers that
are too large since this is easier to handle at a later stage.
Fix your Schobbes grammar. Green bar.
Parsing Integers
Run the CIAT tests. They fail.
The failure shouldn't come from the parser, though. You changed
its definition of an INTEGER
, but it's fine with the
new definition. However, you should add some unit tests to make
sure it stays working.
Add three more assertions to shouldParseIntegers()
.
Green bar.
Integer TIRs
The CIAT tests are complaining about the TIR builder. This is because no matter what integer you parse, the TIR builder always returns a 5!
Add three more assertions to
shouldBuildIntegerTIRs()
. Red bar.
Presently, you should have a tree parser with this rule:
program returns [ExpressionTIR tir] : INTEGER { tir = new IntegerETIR(5); } ;
The INTEGER
in this rule matches any
INTEGER
that the tree builder finds. You just need
some way to refer to that integer in the Java code so that you can
replace that 5
with something more general. The syntax
for this is fairly simple: i=INTEGER
instead of just
the INTEGER
. The i
is now a variable that
you can use in the Java code for this rule.
Instead of the 5
, you want to get the text from the
INTEGER
in i
. Well, i
is a
org.antlr.runtime.tree.Tree
, and if that had a way
to get the text (as a String
)...
And if there were an IntegerETIR
constructor that
took a String
...
Fix the tree builder. Green bars all around.
Hexadecimal Numbers
User Story #3
Recognize hexadecimal numbers Schobbes allows integers to be
entered in hexadecimal with a 0x
prefix and digits 0
through 9 and a through f (lowercase only).
Write at least three CIAT tests. The output will be in decimal. Red bar.
The entirety of the work for this is in the lexer. The
IntegerETIR
constructor already handles
hexadecimal numbers. The parser and TIR builder otherwise only care
about recognizing INTEGER
s, not what kind of integer
they are.
So add to the INTEGER
rule.
Keep in mind that a
through f
are
valid digits only if the integer begins with
0x
. Write refuting assertions to ensure
this!
Write a test method
shouldRecognizeHexadecimalNumbers()
with at least four
assertions. Red bar. Fix the lexer. Green bars all around.