Programming Assignment 3 (Due on Feb 18)

In this assignment you will build a parser - a program that reads a text file and constructs a syntax tree. You will write a top-down parser from scratch and use a program that writes a bottom-up parser for you based on a context-free grammar you provide.

Files

As with the previous assignment, there is a directory hierarchy to the files; please preserve this hierarchy when you download and work with the files.

Problem 1

The directory p1/ contains the framework for a top-down (recursive descent) parser for a Nocaf program. Your task is to extend this program to parse all of the Decaf grammar. The two grammars are given below in a notation called EBNF (Extended Backus-Naur Form). The brackets [ ] are used for optional constructs - that is [ X ] means ( X | <empty> ) in standard notation. Parentheses are used for grouping, and the * is the usual Kleene star.

Nocaf grammar

ClassDecl -> class Id "{" MethodDecl* "}" <EOF>

MethodDecl -> static Type Id "(" FormalParams ")" Block

Type -> int

FormalParams -> <empty>

Block -> "{" Statement* "}"

Statement -> <empty>

Decaf grammar

High-level program structures

ClassDecl -> class Id "{" MethodDecl* "}" <EOF>

MethodDecl -> static Type Id "(" FormalParams ")" Block

Type -> int  
      | boolean

(No void methods because we have no mechanism for side-effects.)

FormalParams -> [FormalParam ("," FormalParam)*]

FormalParam -> Type Id

Statements

Block -> "{" Statement* "}"

Statement -> Block
           | LocalVarDecl  // LL(2) to distinguish from AssignStmt
           | AssignStmt    // if non-reserved type names are allowed.
           | IfStmt
           | WhileStmt
           | ReturnStmt
           
LocalVarDecl -> Type Id ";"

AssignStmt -> Id "=" Expression ";"

IfStmt -> if "(" Expression ")" Statement [else Statement]

(Your parser must disambiguate the dangling else!)

WhileStmt -> while "(" Expression ")" Statement

ReturnStmt -> return Expression ";"

Expressions

Expression -> ConditionalAndExpr ("||" ConditionalAndExpr)*

ConditionalAndExpr -> EqualityExpr ("&&" EqualityExpr)*

EqualityExpr -> AdditiveExpr (("=="|"!=") AdditiveExpr)*

AdditiveExpr -> MultiplicativeExpr (("+"|"-") MultiplicativeExpr)*

MultiplicativeExpr -> PrimaryExpr (("*"|"/"|"%") PrimaryExpr)*

(Note no UnaryExpr -- we don't have unary minus.)

PrimaryExpr -> Num  // Integer literal
             | Bool    // Boolean literal
             | Id
             | CallExpr   // LL(2) to distinguish from plain Identifier.
             | "(" Expression ")"

CallExpr -> Id "(" ActualParams ")"

ActualParams -> [Expression ("," Expression)*]

Getting Started

Download the files and issue the following commands at the Unix prompt:

% cd p1
% javac Lexer.java
% javac Nodes.java
% javac Main.java
% java Main ../Empty.nocaf

You should get the following output:

ClassDecl
| Id "empty"
| MethodDecls
| | MethodDecl
| | | Id "int"
| | | Id "nothing"
| | | FormalParams
| | | | null
| | | Block
| | | | Statements
| | | | | null
| | null

The program has digested the input Nocaf file, built a syntax tree, and printed out its structure. Most of the architecture you will need to finish the parser is already present in the Nocaf parser in Parser.java. You will have to understand the organization of the parser and reuse some of its ideas. What follows is a general overview of your starting-point code.

`Lexer.java`

This file contains a fully-working lexer for Decaf - that is, it's a correct solution to the previous homework. You won't have to change the code in this class. It does have some added features:

There are added classes Location and Range to keep track of the position of tokens and higher-level structures in the text you are parsing. This is very important to make error messages usable. You will probably see your share of them.
The Token class now defines a bunch of integer constants which it uses to identify its type, rather than the strings we used in the previous assignment. It also defines an array of spellings. These are used in the finishWord() and finishSymbol() methods to make them more efficient.

`Nodes.java`

This file contains a full set of classes to describe nodes in a Decaf syntax tree. These are similar to the ones you used in Problem 1 of the first assignment. You also won't have to change the code in this class. Some added features are:

There is an inheritance hierarchy of these nodes. The basic class Node provides a method for returning the name of an instance's class, which is used in the toString() methods, instead of hard-wiring the names as was done in the previous assignment. There are subclasses of the basic class for expressions and statements:


Node
  ClassDecl
  NodeList
    MethodDecls
    FormalParams
    ActualParams
    Statements
  Listable
    MethodDecl
    FormalParam
    Statement
      Block
      LocalVarDecl
      AssignStmt
      IfStmt
      WhileStmt
      ReturnStmt
    Expression
      Id
      Num
      Bool
      ArithmeticExpr
      ConditionalExpr
      EqualityExpr
      CallExpr

To make lists easier to manage, there is a special list node type, which has some of the functionality of the tree code in Problem 3 of the first assignment.
There is a separate utility class called Indenter. It takes the list-format representation returned by a node's toString() method and converts it to a more readable indented form.

`Main.java`

This is where it all starts. The main() method creates a Scanner and Lexer, as in the previous assignment. Then it creates a Parser, and gets from it a node that represents the text it parses. You might want to uncomment some of the commented-out code in this file during debugging.

`Parser.java`

This is where all the action is. At the bottom of the file you see the method parseClassDecl() that gets called first.

The code for the parser uses dependency-ordering, in which (as far as possible) code at a given position in the file depends only on code that precedes it. This means the high-level routines occur at the bottom, in the reverse order from the way the structures are presented in the grammar. If this drives you crazy, go ahead and reorder the grammar or the code.

From the Decaf syntax

ClassDecl -> class Id "{" MethodDecl* "}" <EOF>

you see that a class declaration consists of the class keyword, a name, a left brace, a sequence of method declarations, a right brace, and an end-of-file. The code in the method parseClassDecl() reflects that structure quite closely:

  ClassDecl parseClassDecl()
    {
    match(Token.CLASS);
    Id name = parseIdentifier();
    match(Token.LBRACE);
    MethodDecls mds = parseMethodDecls();
    match(Token.RBRACE);
    match(Token.EOT);
    return new ClassDecl(name, mds);
    }

In fact, it's quite easy to translate from a grammar to a parser - in the next problem we will automate this.

The code provided also has an example of how to parse sequences and put them into a list node. You have to add the code to parse statements and expressions in Decaf. The expressions may be the most difficult.

There are also lines, commented out in main(), that create and use a Recognizer instead of a Parser. This is just there to help you proceed in manageable steps. The recognizer code doesn't build a syntax tree so the flow of its control is easier to follow. It's not a requirement, but you might find it useful to first write the recognizer for Decaf, and then add to it the tree-building code.

Problem 2

The next problem is to create a parser for the same language, but using the automated tool JavaCUP and the LR grammar provided below. The source code (in Java) and documentation for this tool is available at http://www.cs.princeton.edu/~appel/modern/java/CUP/. The CS-160 home directory has a pre-compiled version of this tool.

Decaf LALR grammar

High-level program structures

ClassDecl ::= class Id "{" MethodDecls "}" 

MethodDecls ::= MethodDecls MethodDecl
             |  /* empty */

MethodDecl ::= static Type Id "(" FormalParams ")" Block

FormalParams ::= ProperFormalParams
              | /* empty */ 

ProperFormalParams ::= ProperFormalParams "," FormalParam
                    |  FormalParam
   
FormalParam ::= Type Id
      
Type ::= int
      |  boolean

Statements

Block ::= "{" Statements "}"

Statements ::= Statements Statement
            |  /* empty */

Statement ::= Block
           |  LocalVarDecl
           |  AssignStmt
           |  IfStmt
           |  WhileStmt
           |  ReturnStmt

LocalVarDecl ::= Type Id ";"

AssignStmt ::= Id "=" Expression ";"

IfStmt ::= if "(" Expression ")" Statement
        |  if "(" Expression ")" Statement else Statement

(Your parser must disambiguate the dangling else!)

WhileStmt ::= while "(" Expression ")" Statement

ReturnStmt ::= return Expression ";"

Expressions

Expression ::= ConditionalAndExpr
            |  Expression "||" ConditionalAndExpr

ConditionalAndExpr ::= EqualityExpr
                    |  ConditionalAndExpr "&&" EqualityExpr

EqualityExpr ::= AdditiveExpr
              |  EqualityExpr "==" AdditiveExpr
              |  EqualityExpr "!=" AdditiveExpr

AdditiveExpr ::= MultiplicativeExpr
              |  AdditiveExpr "+" MultiplicativeExpr
              |  AdditiveExpr "-" MultiplicativeExpr

MultiplicativeExpr ::= PrimaryExpr
                    |  MultiplicativeExpr "*" PrimaryExpr
                    |  MultiplicativeExpr "/" PrimaryExpr
                    |  MultiplicativeExpr "%" PrimaryExpr

PrimaryExpr ::= Num
             |  Bool
             |  Id
             |  CallExpr
             |  "(" Expression ")"

CallExpr ::= Id "(" ActualParams ")"

ActualParams ::= ProperActualParams
              | /* empty */ 

ProperActualParams ::= ProperActualParams "," Expression
                    |  Expression

Getting Started

Download the files and issue the following commands at the Unix prompt:

% setenv CLASSPATH ".:/fs/cs-cls/cs160/lib"
% java java_cup.Main -parser Parser -symbols Sym < P2.cup
  [You should get 17 warnings, but no errors at this step.]
% java JLex.Main Decaf.jlx
% javac Decaf.jlx.java
% javac Nodes.java
% javac Main.java
% java Main ../Empty.nocaf

You should get the same syntax tree output you did at the beginning of Problem 1. The order you do these things in is important. Here's what's being done:

java java_cup.Main... The compiler-generating tool JavaCUP is reading the specification file p2.cup and writing the code for the parser, which it puts in a file called Parser.java because of the command line directive -parser. It also creates a class called Sym that defines the integer constants used to describe tokens. You have to do this before generating and compiling your lexer. You get the warnings because the specification file describes the entire Decaf lexicon, but it only compiles Nocaf.
Note that other parts of this project depend on the names of the parser and symbol classes being Parser and Sym respectively. It would be nice if we could put those directives in the p2.cup file, as we do with the %-directives in the Decaf.jlx file. Since we can't, you have to be careful to use those command line options each time you run JavaCUP.
java JLex.Main Decaf.jlx - Now you create the Lexer's code.
javac Decaf.jlx.java - Compile the lexer. This automatically compiles Sym.java also.
javac Nodes.java - Since this file contains classes needed by the parser, and they are not split up into separate files, one per class, with the file and class names matching, the Java compiler will not find them when it compiles the parser. So you have to do it first. This is the identical file to the one you used in the first problem; it's just duplicated here for your convenience.
javac Main.java - Compile the main program. This automatically compiles the parser also.

The documentation for JavaCUP is pretty good; here's a brief guide to what you will find already set up.

`P2.cup`

This file starts with a few directives containing Java code that is to be incorporated verbatim into the generated parser. We tell it to add a private member of type Lexer to its parser, and to call that lexer's next() method when it wants a new token. The parser itself doesn't want to know about the lexer's token type; it wants an instance of its own type Symbol. You have to do some poking around in the JavaCUP source to find out what this is, but it's a simple structure with integer fields for type and range, and a generic object field which we use for the spellings:

class Symbol
  {
  public int sym, left, right;
  public Object value;
  }

So we create a new instance of this class and pass into its constructor the fields from our token. You won't have to change any of this code.

Next you tell the parser-generator what kinds of terminal symbols you want. These names are put into the file Sym.java as constants, and your lexer will use them. You also won't have to change these - they already specify the full Decaf lexicon.

Next comes the list of non-terminals, that is, left-hand sides of productions from the Decaf grammar. You will certainly have to add to this list. Notice that listing them here is just to make the JavaCUP's own parser simpler - it could instead make an extra pass to gather this information.

And finally, the grammar itself, put in proper LR CFG form. With each production are associated some variables and some Java code. For example, look at the highest-level production

ClassDecl ::= CLASS ID:name LBRACE MethodDecls:mds RBRACE 
                {: RESULT = new ClassDecl(new Id(name), (MethodDecls)mds); :};

We describe the syntax of the class declaration, and give names to some of its members. Those names represent results of things already parsed. For example, elsewhere in the grammar there is a production rule for MethodDecls which assigns a MethodDecls node to the variable RESULT. The variable mds gets that value when we get to the production shown here. The variable name represents the String containing the name of the class; it's a String because of the directive

terminal String NUM, ID;

earlier in the file. The compiler created by JavaCUP knows this, and you can use the variable name directly. JavaCUP does not know that mds is an instance of MethodDecls so you have to cast it before using it.

The value we assign to the RESULT value (predefined by JavaCUP) gets assigned to the non-terminal that appears on the left-hand side of the production (in this case, ClassDecl). It will be available in other production rules in which ClassDecl appears.

The grammar provided gives examples of how to parse simple structures, and how to create lists. You have to extend it to parse all of Decaf. You will be adding more productions to P2.cup and recompiling using the following steps:

% java java_cup.Main -parser Parser -symbols Sym < P2.cup
% javac Parser.java
% java Main <your source file>

The syntax error messages provided by JavaCUP don't have line numbers, only character numbers counting from the start. If you don't use a text editor that can locate text by its character number, it could be very inconvenient to track down the offending text. A utility program called findit is provided for you. Here is an example of using it to find what the problem is at character 273 of a file called test.decaf:

% findit test.decaf 273
Line 22: static ant f()
                ^