Calvin seal CS 382: Special Topics - Compilers
Spring 2005

Hobbes

What is Hobbes?

Andrew Appel, in his compiler textbooks, proposes a relatively simple language named "Tiger" that can be easily extended with more advanced features. However, we are not using one of his books, there are more than a few obnoxious parsing issues with Tiger, and there are just too many things being called "Tiger" these days.

Consquently, I've modified Appel's Tiger, and I've called the resulting language "Hobbes".

Support Files

There are a couple support files for the Hobbes AST that serves as our intermediate representation. These files were written with Java 2 SDK 5.0, but they should be backwards compatible if you're using an older SDK; let me know if they cause a problem with older SDKs.

Notation

The tokens in this document are abbreviated (e.g., id for "identifier"); if any are unclear, let me know. Different names may be used for different semantic roles, not syntactical difference; for example, syntactically id and typeId are exactly the same, but semantically the former specifies any identifier while the latter is specifically for a type identifier (which can be determined only during the semantic analysis).

Non-terminals in the grammar are denoted in an emphasized font (e.g., id, type). Words (including punctuation!) in a monospaced font are keywords or operators. The productions also use the regular-expression star (i.e., *) and Klene star (i.e., +); the meanings are the same here. Parentheses are used to group tokens.

The empty string is represented by epsilon.

Lexical Issues

An identifier is a sequence of letters, digits, and underscores, starting with a letter. Case matters (e.g., foobar is different from FooBar).

Comments can appear between any two tokens. Comments start with /* and end with */; they cannot be nested.

Declarations

Variables, types, and functions must all be declared.

decs --> dec*
dec --> typeDec
--> varDec
--> funcDec

Data Types

Hobbes has four types of data: integers, strings, arrays, and records. Hobbes also has type aliases.

typeDec --> type id = type
type --> id
--> { typeFields }
--> array of id
typeFields --> epsilon
--> fieldDec (, fieldDec)*
fieldDec --> id : typeId

The integer type is named int; the string type is named string.

A record type is specified by a list of the fields of the record. The two identifiers in a field declaration are the field name and the field's type (which must be an identifier); the field's name comes first. Different records can use the same field names.

The length of an array is not part of its type. The length is determined at runtime.

Each declaration of an array or record type is unique and incompatible with all other type declarations even if the component types are simliar.

Data types can be mutually recursive:

type intlist = { head: int, tail: intlist }

type tree = { key: int, children: treelist }
type treelist = { head: tree, tail: treelist }

New type expressions create new data types (e.g., intlist and treelist above would be different data types even if their heads were the same type). However, a type variables are transitive:

type a = array of int;
type b = a;
type c = b;

Types a, b, and c are all the same data type.

Variables

vardec --> var id := exp
--> var id : typeId := exp

Every variable must be declared before it is used; every variable must be initialized (the exp in these productions). However, not every variable must be typechecked.

Since nil is a special value (the null pointer), its specific type at a given use cannot be inferred, so a variable declarations that initialize a variable to nil must also specify the data type of the variable (i.e., the second production).

Functions

funcDec --> function id ( typeFields ) = exp
--> function id ( typeFields ) : typeId = exp

The first production is for a producedure (which does not return a value); the second production is for a function (which, as you might guess, does return a value).

All arguments are passed by value.

Function may be recursive:

function treeLeaves(t: tree) : int =
  if t = nil then 1
        else treelistLeaves(t.children)
        endif
function treelistLeaves(L: treelist) : int =
        if L = nil then 0
        else treeLeaves(L.head) + treelistLeaves(L.tail)
        endif

Scope Rules

A let expression declares local variables whose scope is only the body of the let. Types declared in a let expression are valid only in that let. Similarly, functions declared in a let expression are valid only in that let.

The scope of parameters are the body of the function.

Nested scopes are permitted. Inner scopes can access the variables in outer scopes. Inner scopes can shadow variables in outer scopes.

Hobbes uses two namespaces, one for variables and functions, the other for types. Thus it is possible to use the same identifier in the same scope as both a type and a variable name.

L-Values

Variables, procedure parameters, fields of records, and elements of arrays are all l-values (i.e., locations whose value can be read or assigned).

lvalue --> id
--> lvalue . id
--> lvalue [ exp ]

The meanings of these productions are fairly straightforward; the exp of the array-element access must be an integer.

Expressions

An l-value, when used as an expression, evaluates to the contents of its location.

Some expressions do not evaluate to a value: procedure calls, assignment, if-then (lacking an else), while, break, and sometimes if-then-else.

The value nil is Hobbes's null pointer. It is legal to sent any variable of any record type to nil. It is illegal (of course) to access a field of a variable set to nil.

A sequence of expressions can be specified with parentheses:

( exp ; exp ; ... exp )

The expressions are evaluated in the order listed, and the value of this sequence is the value of the very last expression (if any).

The token sequence () represents "no value". Similarly, a let expression without any content results in "no value".

An integer constant is a sequence of decimal digits. Unary minus can be used to negate an integer constant.

A string literal is a double-quote delimited sequence of zero or more printable characters, whitespace, and escape characters. Hobbes recognizes these escape characters (with their standard meanings): \n, \t, \", \\.

A function is invoked with arguments to fill the parameters. Arguments are matched to parameters in the order listed, left-to-right.

Arithmetic is done with the +, -, *, and / operators. They take integer arguments and return an integer.

Comparisons are done with =, <>, <, >, <=, >=. They compare their operands appropriately, and return 1 if true, 0 if false. All of these operators apply to integers and strings; records and arrays can be compared only with the equals and not-equals operators, and they compare only pointer equality.

Boolean operators & and | short-circuit their computation. Any non-zero value is taken as "true"; 0 is taken as "false".

Precedence (from highest to lowest):

  • Parentheses
  • Unary minus
  • *, /
  • +, -
  • =, <>, <, >, <=, >=
  • &, |
  • :=

The arithemtic and boolean operators are left associative. The comparison operators do not associate.

A record is created with the new keyword:

new typeId {}
new typeId { id=exp ( , id=exp )* }

Arrays too:

new typeId [ exp ] of exp

The first expression is of type integer and is the length of the array; the second expression is the initial value for each element of the array. Arrays are indexed 0 up to (but not including) the length of the array. In case of both arrays and records, typeId must be a valid array or record type, respectively.

When records or arrays (not their contents) are involved in an assignment, the pointer to the record or array is assigned to the destination. This applies also to parameter passing.

Records and arrays have inifinite extent---their lifetime extends beyond the code that created them.

An assignment statement has this form: lvalue := exp. It evalues to "no value"

The if-then-else statement comes in two forms:

if exp then exp else exp endif
if exp then exp endif

Semantically the statement works as you would expect. (Remember that 0 is "false"; non-zero is "true".) The then and else expressions are evaluated only if necessary. The second form of the statement produces "no value".

A while loop looks like this:

while exp do exp done

As long as the first expression evaluated to a non-zero value, the body is executed.

A for loop looks like this:

for id := exp to exp do exp done

The id variable is implicitly declared (as an integer variable) for this statement; its scope extends only to this loop. The upper and lower bounds are evaluated exactly once, prior to entering the body of the loop. The body is never executed if the upper bound is less than the lower bound.

The break statement, break, terminates the nearest enclosing while or for expression. A break statement is invalid if it does not syntactically appear within a loop.

A let expression declares variables for a sequence of expressions:

let decs in expSeq end

There are zero or more expressions in the expression sequence, separated by semi-colons. There will be at least one declaration (otherwise, why use the let expression?). The result of the let expression (if any) is the value of the last expression in the sequence.

Standard Library

function print(s: string) prints s to the standard output.

function flush() flushes standard output.

function getchar() : string reads a single character from standard input; returns an empty string on end of file.

function ord(s: string) : int returns the ASCII value of the first character of s.

function char(i: int) : string returns a single-character string containing the specified ASCII character.

function size(s: string) : int returns the length of s.

function substring(s: string, first: int, n: int) : string returns the n-character substring of character of s starting at index first (zero-based).

function concat(s1: string, s2: string) : string return the concatenation of s1 and s2 (without changing s1 or s2).

function not(i : integer) : integer returns i = 0.

function exit(i: int) terminates the program with code i.