CS 160 - Project 3

Project 3

Flex and Bison: Scanner and Parser Generators

The purpose of this project is to introduce scanner and parser generators, as well as get familiar with regular expressions and bottom-up parsing. The Flex scanner generator and the Bison parser generator will be utilized.

Specification and Description Changelog

A list of all the changes made to this project description is below:

No changes yet.

Files

You can find the provided code files required the project in p3.zip.

The scanner definition is contained in the lexer.l file and the parser definition is contained in the parser.y file. The files also include a Makefile which is used to build the project and can also be used to run all provided test cases and a main.cpp file which is the entry point to the program and just calls the parser.

Description

This project consists of writing a specification for a scanner and a bottom-up parser for a given programming language. The specifications will then be used by the scanner generator Flex and parser generator Bison to automatically generate a C scanner and parser.

The language that will be parsed is a simplified object-oriented programming language created for this course. The language has features such as classes, methods, members, statements, and expressions. The full specification of the language is given below in the Language section of this page.

Lexer Specification

You will need to write a Flex specification for all the tokens in the language, which will be used by the parser. A Flex specification is made up of rules, which are a regular expression to match as well as a block of C code that should return a token for that expression. An example Flex rule is:

[ab]*[c-e]        { return T_ABCDETOKEN; }

This rule will match zero or more a or b characters followed by a single c, d, or e character. It will return the token called T_ABCDETOKEN. Possible patterns (regular expressions) available for use in Flex are described in the Flex manual Patterns page.

As our language has many kinds of tokens, you will need to write many flex rules to match them all. Each rule should return an appropriate token to be used by the parser. The specification should also ignore and discard comments without returning any token or taking any action in the parser. Finally, it should also catch any invalid characters and output an "invalid character" error. The provided lexer.l file already contains a rule which will match any character and output the appropriate error, all you need to ensure is that this rule is only matched as a last resort (all other rules should be preferred).

Flex will always match the longest possible substring. This means that it would prefer to match the expression [a-z]* over matching [a-z] multiple times. In the case where there are multiple rules that have the same match length, then Flex will always prefer the rule listed first in the specification. Therefore you should order your rules in the same order you want them to be applied. If a given rule can never be matched, Flex will output a warning when you build.

There are other features of Flex which you may find useful for this project. Two specific features that we suggest that you investigate are definitions, which allow you to name a regular-expression for use later as a subexpression, and start conditions, which allow you to move between different possible states and only match certain rules while in certain states.

Parser Specification

You will need to write a Bison specification for the language grammar, which will be used to parse the input. A Bison specification is made of a list of tokens, precedence/associativity specifiers, and a grammar.

The token listing looks like:

%token T_PLUS T_MINUS T_MULTIPLY
%token T_FOR T_IF

There may be multiple tokens per line and there may be multiple lines of tokens.

The precedence/associativity specifiers look like:

%left T_PLUS T_MINUS
%right T_UNARYMINUS

Tokens which are listed on the same line have the same precedence. The first line listed is the lowest precedence level and precedence increases until the last line. In the above example plus and minus have the same precedence and unary minus has higher precedence. The associativity of operators is also specified: %left meaning left associative, %right meaning right associative, and %nonassoc meaning no associativity.

The grammar section closely resembles the standard format for CFGs, but must be specified with a particular syntax. An example of two non-terminals and associated rules is:

Parens : T_OPENPAREN Parens T_CLOSEPAREN
       | Inner
       ;

Inner  : T_NUMBER
       |
       ;

Where the colon signifies the start of the first rule option (like the arrow in normal grammar format), the vertical bar separates multiple rule options, and the semicolon means that the rule options for a given non-terminal are finished. A rule may consist of tokens and non-terminals. If a rule has no tokens or non-terminals (as the last rule shown above), then that is an epsilon rule which matches the empty string.

Bison will automatically adapt the grammar to have the correct precedence and associativity, and therefore the grammar should be left ambiguous when referring to operations which have specified precedence and associativity. Because Bison generates LR parsers, left and right recursion can both be utilized as part of the grammar. However, left recursion is significantly more efficient for LR parsers.

If any state of the LR automaton has the option to do a shift action or a reduce action when seeing the same lookahead token, this will cause a shift/reduce conflict. Similarly, if the automaton has the option to do multiple different reductions this will cause a reduce/reduce conflict. Bison will output a warning noting any shift/reduce or reduce/reduce conflicts when you build. It also generates a file called parser.output, which contains a verbose description of all rules, tokens, nonterminals, and states of the parser. This file will contain a list of the states with conflicts, which you can then find farther down the file to see exactly why the conflict is occurring.

Language

A program consists of a sequence of classes. There must be at least one class, and there is no maximum number of classes. A single class has one of two structures, which are identical except for an optional superclass, which the class inherits from. The body of the class consists of members followed by methods, in that exact order. The two structures can be seen below:

ClassName extends SuperclassName {
  Members
  Methods
}

ClassName {
  Members
  Methods
}

There may be zero or more members inside a class, and each member must be declared in its own declaration. A member declaration has the following form:

Type membername;

Methods

There may be zero or more methods inside a class. Each method has a name, and a list of zero or more parameters separated by commas, a return type, and a body. The method declaration has the following form:

MethodName(Parameters) -> ReturnType {
  Body
}

The parameter list consists of zero or more parameters, which are separated by commas if there are two or more. A parameter has a type and a name, and has the following form:

Type identifier

The body of each method consists of zero or more declarations, followed by zero or more statements, and then finally followed by an optional return statement. The declarations must appear before statements, and both must appear before the return statement.

A declaration defines one or more variables of a single type. The form of a declaration is:

Type identifier, identifier, ..., identifier;

If there is more than one identifier listed, the identifiers are separated by commas.

The optional return statement has the following form:

return Expression;

Statements

A statement has six possible forms: assignment, method call expression, if-else, while loop, do-while and print.

An assignment statement consists of a destination variable and an expression, and has one of the the following forms:

identifier = Expression;
identifier.identifier = Expression;

A method call statement consists of a single method call expression followed by a semicolon.

An if-else statement consists of a condition expression, a block of statements to execute if the condition holds, and an optional block of statements to execute if the condition does not hold. The two possible structures can be seen below:

if Expression {
  Block
}

if Expression {
  Block
} else {
  Block
}

A while statement consists of a condition to check before each loop iteration and a body, which is a block of statements. The structure can be seen below:

while Expression {
  Block
}

A do-while statement consists of a body (which is a block of statements) and a condition to check before each loop iteration after the first iteration. The body is executed again and again as long as the given condition is True. The structure can be seen below:

do {
  Block
} while (Expression);

The Blocks of statements used in if-else, while, and do-while statements consist of one or more statements.

A print statement consists of a single expression to print as output, and has the following form:

print Expression;

Expressions

An expression has many different forms. A method call has two forms: a self-call, where a method is called on the same object, or a method call on a specified object. A grammar specifying all possible expression (including method call expression) is below:

Expression	→	Expression + Expression
	\|	Expression - Expression
	\|	Expression * Expression
	\|	Expression / Expression
	\|	Expression > Expression
	\|	Expression >= Expression
	\|	Expression == Expression
	\|	Expression and Expression
	\|	Expression or Expression
	\|	not Expression
	\|	Expression ? Expression : Expression
	\|	- Expression
	\|	identifier
	\|	identifier . identifier
	\|	MethodCall
	\|	( Expression )
	\|	int literal
	\|	True
	\|	False
	\|	new ClassName
	\|	new ClassName ( Arguments )
MethodCall	→	identifier ( Arguments )
	\|	identifier . identifier ( Arguments )
Arguments	→	Arguments'
	\|	ε
Arguments'	→	Arguments' , Expression
	\|	Expression

Where italicized symbols are non-terminal, and bold symbols are terminal (lexemes).

The precedence of operators is as follows (from lowest to highest): ?:; then or; then and; then greater-than (>), greater-than-or-equal (>=), equal-to (==); then plus (+), minus (-); then multiply (*), divide (/); then not, unary minus. All operators are left-associative, excepting "not", "unary minus" and "ternary operator", which are right associative.

Types

A type is either int, boolean, or a name of a class (meaning that the variable is an object of that class). The keyword int specifies an integer and boolean specifies a boolean. A function return type may be any of the three allowed types or none. None signifies that the method does not have a return statement and therefore does not return anything, and is specified by the none keyword.

Lexemes

Class names, member names, method names, and identifiers all have the same form. They consist of a single letter a-z (upper or lower case), followed by zero or more letters and digits. They may not be the same as any of the reserved keywords (print, return, if, else, while, new, int, boolean, none, ==, and, or, not, True, False, extends, do).

Integers literals consist of either a single zero or a non-zero digit followed by zero or more digits.

Language keywords are case-sensitive and must be entirely lower-case letters.

Comments

Comments in the language are the same as multi-line C comments. A comment starts with /* and ends with */. The first closing comment sequence immediately closes the comments (they are not longest match and they are not nested). Any characters may appear inside the comments and should be completely ignored by the compiler. Comments must be closed, and unclosed comments at the end of the program should cause a "dangling comment" error to be output.

Whitespace

Whitespace between valid tokens should be ignored (however you need to keep track of the line numbers for error reporting). Any amount of whitespace is allowed (including no whitespace).

Resources

Some resources which you may find extremely useful for this project (more may be added as the project progresses):

Bison Manual
Flex Manual
Specific sections of the above manuals, linked in the project description section

Requirements

To obtain full credit for this project, your solution will need to:

Parse all valid input programs in the language with no errors.
Reject all invalid input programs with a single error message.
Include correct line numbers in error messages.
Build with no shift/reduce or reduce/reduce conflicts in Bison.
Build with no rule cannot be matched or other warnings in Flex.

It is highly recommended to verify your solution output for the provided test cases (a single error for invalid programs, NN.bad.lang files; and no ouptut for valid programs, NN.good.lang files). More information about how to do this can be found in the Grading section.

Make sure that you compile, run, and test your program on the CSIL server. Especially if you write your program on your own machine. You may have a newer version of Flex and/or Bison installed than the versions on CSIL, so it is extremely important to make sure that all above requirements are satisfied when building on CSIL.

Deliverables

You should submit the files lexer.l and parser.y with complete definitions of the scanner and parser to the GradeScope.

You also must include a README file which includes your name, perm number, email address, and any issues with your solution and explanations.

You do not need to submit the test cases, makefile, or the unmodified source files.

Grading

Your grade will be based on the proportion of test cases for which your program produces correct output. We have provided some of these test cases along with the code so that you can test your program in most of the cases. However, we will run your programs on more tests and you should come up with test cases other than the ones we gave you to test your program thoroughly.

Test cases that have name NN.good.lang should parse successfully with no errors and test cases with name NN.bad.lang should fail to parse with a single error.

To run your solution against all the tests cases, you can type make run. This will run all test cases in the tests folder. The expected output when running the runtests.py script is provided in the output.txt file. Remember and be careful that when you type make run command, the ouput for good and bad language will be in verbose mode. So, you will see "No output." for good programs and error message like "syntax error, unexpected T_RETURN, expecting ';' at line 6" for bad programs. This is different when you type make diff command, outputs are not generated in verbose mode. So, you will see "No errors." for good programs and error message like "Error produced on line 6." for bad programs. The expected output given in "output.txt" is in the later format so that for make diff command it can check the difference against "output.txt". So, to compare your solution output to the expected output, you can simply type make diff.