CS 160 - Project 4

Project 4

Abstract Syntax Tree

This project is designed to allow you to become familiar with how an Abstract Syntax Tree (AST) is built. You will build ASTs for programs in our language using syntax-directed translation with Bison actions. ASTs are a form of intermediate representation which are extremely useful in the compilation process. They abstract the program and remove useless information, while still preserving the complete and exact program structure.

Specification and Description Changelog

A list of all the changes made to this project description is below:

No changes so far.

Files

You can find the provided code files required the project in p4.zip.

All files present in the Project 3 code file return in this project. You will need to copy your lexer and parser definitions from your completed Project 3 to the new lexer.l and parser.y files, which have been updated since the last version.

The main.cpp and Makefile files have also been updated to support the AST code and the print visitor. There is one new file called genast.py, which is a script that generates the AST and print visitor code. You can run it by typing make genast, and it will be run any time a normal make is run.

The expected output for all provided test cases is included as the output.txt file.

Description

This project has two main steps: first understand the AST classes that you will use to build the AST, then write Bison actions to actually build the AST.

Step 1: Understand AST Classes

First do make genast to generate the AST classes without building the project. Then look at the file ast.hpp, which is the header file that declares all the AST classes as well as other important information that will allow you to complete the project. The AST classes forward declarations and the YYSTYPE union are near the top of the file, and the actual AST class declarations (including the constructor prototypes) are in the middle of the file. The file ast.cpp provides implementation of all the functions, and you shouldn't have to examine this file too closely (however you may find it interesting!).

You will need to determine the appropriate position in the AST for each node type, as well as what part of the language that node represents. An example is the PlusNode AST node class, which will represent an addition expression. You will also want to note what kinds of children each node has, as well as the ordering of the children in the constructor. Every attempt was made to keep the ordering of the children as similar to the language description as possible.

Also note that there are some abstract types of AST nodes. An example of this is the ExpressionNode type: all concrete expression AST nodes (like PlusNode, MinusNode, etc) will inherit from this type. This allows us to expect an expression node in the appropriate place in the grammar, but to actually construct the appropriate concrete node. The other abstract nodes (apart from Expression) are statement and type (there are multiple types of statements and multiple types).

The last part of the header file that will be useful to you is the YYSTYPE union definition. This is the type of all Bison pseudo-variables in our parser, which will be important when writing %type specifiers and Bison actions (this will be discussed in the next section of this description).

Step 2: Write Appropriate Bison Actions

The next step is to actually implement the Bison actions that will be used in syntax-directed translation to build the AST while parsing. Each grammar production rule will need an appropriate action, which will construct some part of the AST. Actions may need to get information from child production rules (rules called by the current rule) and will need to return information to the parent rule.

The AST will be built from the bottom upwards. This means that the leaves of the tree will be constructed first, and then used by the parent rules and combined into a higher-level node. This will continue all the way to the root of the tree, which will be created by the starting rule of the grammar. This explains the necessity of getting information from child rules and returning information to the parent rule.

To do this information passing, we will use Bison pseudo-variables. These are special variables that represent parts of a production rule in the grammar. Specifically, the $$ variable represents the left-hand side of the rule, and the $1 through $N variables represent the right hand side of the rule. An example is shown below:

 E :    E    +    E
$$     $1   $2   $3

Information will be "returned" to the parent rule by setting the $$ variable, which can then be accessed using the appropriate $N variable in the parent rule (the child rule will appear somewhere on the right-hand side of the parent rule). Similarly, information will be "retrieved" from the child rules using the $1 through $N variables. Therefore a sample action for addition expressions might look like:

 E : E + E     { $$ = new PlusNode($1, $3); }

Some things to note about the above example is the use of the new statement to create a new AST node, the assignment of $$ to a pointer to the new node, and the fact that $2 is unused anywhere in the action. Most terminal tokens will not be used in actions, except identifier and integer tokens.

Some AST nodes have special children which are more than just a pointer to another node. The two cases for these special children are optional children, which are children which may or may not be present in the program (like super class names) and may be set to NULL, and list children, which represent lists of nodes (for example: lists of statements inside a method). For optional children, you will set the child to point to an appropriate node if the corresponding part of the program is present or you will set the child to NULL if it is not present. Our visit children functions only visit an optional child if it is non-null. The three optional children in the AST are the second identifiers for both Class nodes and MethodCall nodes, and the return statement for MethodBody nodes.

List children are implemented using a pointer to a std::list container. These children will appear when you have a recursive rule which can build any number of a given node. Two examples are the lists of statements inside methods (which may be zero or more statements) and the list of classes in the program (which may be one or more classes). The most intuitive way to build the lists necessary for these recursive rules is to create a new list in the terminating rule and then add a new element to the list in each action for the recursive rule. The terminating rule may only construct a new list if it is an epsilon rule, or it may construct the list and add an element if it is non-epsilon. An example of building these lists might look like:

S : S , n  { $$ = $1; $$->push_back($3); }
  | n      { $$ = new std::list<Number*>(); $$->push_back($1); }

Which is constructing a list of nodes of the type Number, and using push_back to add each element to the end of the list when the recursive rule's action is executed. When the parsing of a list is completed, the full list will be available to whichever rule called the recursive rule. The list can then be used as a child of an AST node.

All of the pseudo-variables in the parser are of the type YYSTYPE. To allow us to pass many different types of nodes (or lists) around the parse tree, we have defined this YYSTYPE as a union of pointers to all possible types of AST nodes and all possible lists of AST nodes. You can find this union definition near the top of the ast.hpp file. When using the pseudo-variables, you may use any member of the union for a given pseudo-variable. This means that you can pass a pointer to an expression node when using some pseudo-variables and pass a pointer to a list of statements for some other pseudo-variables.

To specify which member of the union is appropriate for each non-terminal or terminal symbol in the grammar, you will use %type specifiers in Bison. A %type specifier tells Bison that every pseudo-variable that corresponds to that non-terminal or terminal will use the given member of the union. An example of this is:

%type <declaration_list_ptr> Members Declarations

This is a specifier which tells Bison that we are going to use the declaration_list_ptr union member for pseudo-variables for the Members and Declaration non-terminal symbols. We can then directly refer to those two non-terminals without specifying which member of the union we would like to access. This means that instead of writing an action looking like:

{ $$.declaration_list_ptr = $1.declaration_list_ptr; }

We can write a much simpler action looking like:

{ $$ = $1; }

We strongly recommend that you specify types for all non-terminal symbols as well both terminal tokens that will have attached information (identifiers and integers). This will make the project much easier to write, as well as providing type checking as you write it, which will allow you to catch errors much more quickly.

The last step is to get the names of identifiers and values of integers from the lexer to the parser. This will be done using the yylval special variable inside specific Flex actions. Setting yylval will set the corresponding Bison pseudo-variable for that token in the parser. In this way, we can get the information from the lexer and use it in the parser. The yylval variable is also of the type YYSTYPE, so it can be used to pass any type listed as a member of the union. However, the %type specifiers you used for Bison do not apply to the yylval variables in the scanner. Therefore you must explicitly specify which member of the union you would like to use each time you assign yylval. You can, however, use the %type specifier to tokens on the parser side. The union has two members which are C base types, which you can use to pass the name of identifiers and values of integers. They are the last two members of the union, called base_char_ptr (represents a character pointer, or C string) and base_int (represents an int).

Remember to set the root of the AST, which is called astRoot and is an extern variable that is a pointer to an AST node. This will need to be set to the root of the tree once the building of the tree is finished. It should only be set to point to a ProgramNode, as the root of every tree must be of that type.

Project Tips!

A few tips that may help you while completing the project are:

Specify the type of every non-terminal with a %type specifier.
Specify the type of identifier and integer tokens with %type as well.
Set yylval before you return a token in Flex actions, otherwise the code used to set it will never be executed.
It may be easiest to start writing the actions for the bottom-most rules in the grammar, then move up to the higher-level rules. This also allows incremental running and debugging, as everything below the current level will have already been written.
You can cause your program to print out a portion of the tree by setting the astRoot to the highest-level node that you have constructed. For example, you might set the astRoot pointer to point at a single method to allow you to check that part of the tree before you fully implement the action for classes. This allows for a lot more incremental debugging.
Remember to use the debugger when you get a segmentation or other memory fault. You can refer to Project 1 for more information about this.
You may still make changes or fixes to your grammar since the version you turned in for Project 3. Make sure to fix all your shift/reduce or reduce/reduce conflicts if you haven't already.

Resources

Some resources which you may find extremely useful for this project (more may be added as the project progresses):

Bison Manual
Specific sections of the above manual, linked in the project description section
The Wikipedia page for Abstract Syntax Trees
The Wikipedia page for Syntax-Directed Translation

Requirements

To obtain full credit for this project, your solution will need to:

Satisfy all Project 3 requirements.
Construct a complete AST for all valid input programs.
Execute on any input without segmentation faults or any kind of crashes.

It is highly recommended to verify your solution output for the provided test cases. We have provided the expected output for all tests in a file called tests-output.txt. More information about how to do this can be found in the Grading section.

Make sure that you compile, run, and test your program on the CSIL server. Especially if you write your program on your own machine. You may have a newer version of Flex and/or Bison installed than the versions on CSIL, so it is extremely important to make sure that all above requirements are satisfied when building on CSIL.

Deliverables

You should submit to the GradeScope the files lexer.l and parser.y with complete definitions of the scanner and parser and Bison actions that will build the complete AST.

You also must include a README file which includes your name, perm number, email address, and any issues with your solution and explanations.

You do not need to submit the test cases, makefile, or the unmodified source files.

Grading

Your grade will be based on the proportion of test cases for which your program produces correct output. We have provided some test cases and their expected output along with the code files.

All test cases should parse successfully with no errors and generate and print a complete AST. The first few test cases are designed to be smaller and more human-readable, which should aid in debugging your AST generation.

To run your solution against all the tests cases, you can type make run. This will run all test cases in the tests folder. You can compare this output against the expected output by typing make diff.