Project 6

Code Generation

In this project you generate executable x86 assembly code which will implement the input programs to the compiler. This will complete the process of compilation.

Specification and Description Changelog

A list of all the changes made to this project description is below:

  • No changes so far.

Files

You can find the provided code files required the project in p6.zip.

All files from Project 5 will be again present in this project. You will need to copy your completed lexer.l, parser.y and typecheck.cpp files into the code folder. You do not have to selectively copy your rules, as the lexer.l and parser.y files have not changed.

There are three new files in the code folder. The codegeneration.cpp and codegeneration.hpp files define the CodeGenerator visitor which will visit the AST and generate x86 Assembly code. You will need to write implementation of the CodeGenerator visitor inside codegeneration.cpp. The tester.c file is a C entry point that you can link with your assembly code to generate an executable.

For this project you will complete the implementation of the codegeneration.cpp file so that Code Generation is correct for all valid input programs. You may also need to further modify your typecheck.cpp file, depending on what level of completion was achieved in Project 5 and how you choose to implement member inheritance. The symbol table is automatically accessible from the CodeGenerator visitor via magic (look at main).

Description

After the symbol table and type checking are complete, the next step is to generate x86 Assembly code to implement the input program. This will be done in the CodeGenerator visitor functions and will consist of printing the assembly to standard output.

The code generation for expressions will implement a stack machine. This means that the format for most binary operation expressions might look like:

pop operand2
pop operand1
operation operand2, operand1
push operand1

The important x86 instructions for this component are the push, pop, and arithmetic instructions. Arithmetic instructions which are recommended are add, sub, imul, idiv, and neg. You may also want to make use of select logical instructions such as xor, or, and and for some expressions.

In x86 Assembly, there are 8 general purpose registers. In this project you will mostly want to make use of up to 6 of them: eax, ebx, ecx, and edx are general purpose registers which can be used for any operation (arithmetic, memory access, etc.); esp, the stack pointer, which should always point to the top of the stack; and ebp, the base pointer (or frame pointer) which will be used to point to a specific point in the current stack frame. To refer to a register in an instruction, prefix it with %. For example: pop %eax.

Now, using the general purpose registers, we could implement the stack machine code for a plus expression as:

pop %edx
pop %eax
add %edx, %eax
push %eax

This will pop the first operand into the eax register, pop the second operand into the edx register, add the second operand to the first (meaning that the sum is in the eax register), and then push that result back to the stack for the next stack machine operation to use. Code for other expressions may have a very similar format.

Since we are using a visitor pattern to generate the code, we will output the appropriate code for a given node inside the visitor function for that node. We also usually will want to call the current node's visit_children method at some point to have the children generate and output their code. Thus, the complete code for a PlusNode may look like:

void CodeGenerator::visitPlusNode(PlusNode* node) {
    node->visit_children(this);
    std::cout << "  # Plus" << std::endl;
    std::cout << "  pop %edx" << std::endl;
    std::cout << "  pop %eax" << std::endl;
    std::cout << "  add %edx, %eax" << std::endl;
    std::cout << "  push %eax" << std::endl;
}

All code will be output to the standard output stream, which can be done using std::cout. You can also emit whitespace before the instructions to allow yourself to read the generated code more easily. Comments are also possible, and have the form # comment where everything on the line following the hash is commented. These can be on their own line, or can follow an instruction.

You might also notice in the above example that the edx register is popped into first before eax. The order that you pop the child expression results is important, as in operations like subtraction and division, the order of the operands matters. You will get a very different result if you use sub %edx, %eax versus sub %eax, %edx. Remember that the instruction takes the first register operand and subtracts its value in place from the second register operand. So if you have an expression that looks like 2 - 1, you must setup your stack machine code so that the sub instruction looks like (in pseudocode): sub registerholding1, registerholding2, and then push registerholding2. Also remember that since the stack is FIFO, the first thing you pop will be the result of the right or second operand.

Remember that we are using the GNU assembler (gas), which has a specific syntax. gas syntax has the following format: instr source, destination. This means that add %edx, %eax will do eax += edx and mov %edx, %eax will do eax = edx (eax gets value from edx).

The next step after expressions setting stack frames.

When we start executing our function code, the first thing we will do is setup the base/frame pointer. Since we do not want to lose the old base pointer from the previous stack frame, we will save it. This is done by pushing ebp onto the stack. We will then set a new base pointer for our current frame. Specifically, we will want this to point to the old base pointer. Since the stack pointer will be always pointing at the top of the stack, it will point at the old base pointer right after we push it. Therefore we will want to set the new base pointer to the current stack pointer (after we push the old base pointer). The important instructions for this part are push and mov.

We then need to allocate space for the local variables. This will be done by moving the stack pointer. As we move the stack pointer, we are essentially skipping memory that will not be used when pushing and popping to the stack. All of those pushes and pops will happen after that skipped memory. This is how we allocate memory on the stack. Since the stack grows downwards, we will want to move the stack pointer some distance down the address space. This part will make use of the local variable size of the current method and will most likely use the sub instruction.

At this point we can now execute the body of our function. In the visitor pattern, this means we want to visit the children.

Once the body of the function is finished executing, we need to return. To return from a function we need to: save the return value, deallocate the local variable space, restore the old base pointer for the previous function, and then jump back to the return address. The return value will be placed into the eax register (this register is always used for the return value in the __cdecl style of function calls). Because our expressions are implemented using a stack machine, we will need to pop the result of the expression that is part of the return statement into eax.

Next, we need to deallocate the local variable space. Conveniently, the current base pointer will always be pointing at the local immediately before the local variable space. Therefore to deallocate, we can just move the stack pointer to point at the same place as the current base pointer. Moving the stack pointer up will do the opposite of moving the stack pointer down, and will perform a deallocation of memory on the stack. Since the old base pointer was pushed to the stack immediately before the local variable space, our next step of restoring the old base pointer can just be a pop into ebp.

Finally we need to return to caller function by jumping back to the return address. This is accomplished with the ret instruction, which pops the return address from the stack and then jumps to it. This will be last instruction generated for any function. Since our programs are entirely composed of methods, ret will also be the last instruction generated overall.

The code that we execute before our function body, but inside the callee function is called the function prologue. This code will need to be generated before the children (inside) of a method are visited. The code that is executed at the end of the callee function to cleanup and return is called the function epilogue. This code will need to be generated after the method body, and will therefore need to come after the children are all visited.

The last piece of the puzzle for functions (not including calling them) is a label. This label signify the start of the function and must be present for every function. In this class, we will always use the following format for function labels: classname_methodname. Since all functions are methods, they will all be inside a class and have a class name. An example is Main_main, which is the label the entry point function. The format for a label in assembly is label:. Your visitor will need to generate appropriate labels for each function and output them. To do this, you will need the name of the current class as well as the name of the method you are generating a label for. The name of the method will be accessible as a child of the MethodNode, and there is a member of the CodeGenerator visitor called currentClassName which can be used to store the current class name. There is also a member called currentMethodName which can be used to store the name of the current method, if you need this information elsewhere.

The caller and callee are both responsible for saving (preserving) a specific set of registers. You can do this by pushing them to and popping them from the stack strategically. The caller save registers are: %eax, %ecx, and %edx. The callee save registers are: %ebx, %esi, and %edi.

Next, you need to handle variables. Each assignment statement will store the right-hand-side expression value to the correct location in the local variable space, and each variable access expression (an expression consisting of a single identifier) will load a value from the correct location in the local variable space. Both of these can be accomplished with the mov instruction. This instruction will move data from a source to a destination, and the source and destination can either be a memory address or a register.

Memory addressing in gas syntax has multiple different syntax options, which can be found on this Wikibook page. The syntax for local variables will usually have the form offset(pointer), where the pointer will be the base pointer ebp. So a memory load from 4 below the base pointer (first local variable) to the eax register might look like: mov -4(%ebp), %eax. A move from the eax register to 4 below the base pointer would look like: mov %eax, -4(%ebp).

For each local variable, you will need to output an instruction that uses the correct offset for that variable. This information will come from the symbol table when you lookup the variable in the current variable table. The returned VariableInfo will contain the offset that you set while constructing the symbol table. Note that the offset for the same variable will always be the same, which is how we ensure that different stores and loads to the same variable will access the exact same location. To allow you to access the current variable table, there is a member of the CodeGenerator visitor called currentMethodInfo. You can set this to be the MethodInfo for the current method whenever you enter a method and then use the variables member of MethodInfo to access the current local variable table. Similarly there is a currentClassInfo member of the visitor, which will be used in a similar fashion once we introduce objects.

Once all your code is generated for the input program, you need to make the entry point function, Main_main, callable from C code. To do this, use the .globl directive in assembly (note that there is no a, and it is not "global"). You will want to emit a .globl directive that tells the linker that the Main_main label is a callable function, which will look like:

.globl Main_main

This will come at the top of the assembly file, meaning that you want to generate it at the very beginning of your visitation.

At this point you should have a complete and callable assembly program! The only thing left is to assemble it and link it with the provided C entry point in the tester.c file. To do this you could use the following command (assuming that your assembly code is in a file called code.s):

gcc -m32 -o executable tester.c code.s

This will create a real executable that you can run, and will call the Main->main function in the input program passed in to your compiler. A few things to note about this command: try to always give your assembly files the .s file extension, this will make sure that gcc knows that it is an assembly file and should be assembled; and make sure to use the -m32 flag, which will tell the assembler and gcc that we are in 32-bit mode. The flag is important, and your assembly will not assemble without it.

Now that we have code generation done for simple programs, we are now going to extend it to all valid input programs. This means implementing more functions in the CodeGenerator visitor. Not all functions must have implementation, the only requirements is that enough are implemented to produce valid code that implements any input program.

The main program components that still need to be implemented are control flow, objects, method calls, and inheritance. In addition to these main features, any other unimplemented statements and expressions need to be implemented as well.

Control flow will need to be implemented using jump instructions such as jmp, je, and others. These instructions jump to another location in the code, and can jump unconditionally or based on some condition. The condition will be checked on a cmp instruction which immediately precedes the jump. For example, to jump to the label called label_1 if %ebx is less than or equal to %eax, you might do:

cmp %ebx, %eax
jle label_1

If-Else statements should execute the first list of statements if the expression evaluates to true. If the else branch is present, it should be executed if the expression evaluates to false. Only one branch should be executed at any specific program point. To achieve this, you may need to visit the children in a specific order with generated assembly in between some of the children. If you visit the children manually, make sure you don't call visit_children as well, or each child will be visited twice. To visit a specific ASTNode, call its accept function and pass in the visitor that you want to visit it. In most cases this will look like nodeToVisit->accept(this). All ASTNodes have the accept function, so you can call it on any child.

While statements should check the guard expression, and execute the loop body if the expression is true. If the guard is false, then the loop execution terminates and the loop body should not be executed. If the body executes, it should then check the guard expression and repeat the process. Do while statements are similar, except that the guard expression is evaluated after the loop body is executed.

Objects will be allocated on the heap using the malloc standard library function. The allocation happens when a new expression is executed. To call malloc, you will push one parameter to the stack which represents the size of the object to allocate. malloc returns a pointer to the allocated space, which should be pushed to the stack as the result of the new expression. An example of calling malloc to allocate an object of size 12 (with 3 members) is:

push $12
call malloc
add $4, %esp
push %eax

All objects will be allocated before they are used for member access, and all test programs follow this rule. Make sure to push the correct argument for malloc, and also make sure to remove the argument after malloc returns. It is not necessary to free objects when they go out of scope, and it is not necessary to do any kind of garbage collection or memory cleanup.

After allocating an object, its constructor will be called. If the new statement has arguments, then these will be passed in to the constructor. Constructors are methods which have the exact same name as the class (including capitalization) and return none. If no constructor is present, then nothing needs to be called after malloc. However this should be a type error if there are arguments in the new statement and no constructor. Constructors will have the self pointer available, as with all other methods, and can therefore do initialization of the new object.

The destination (left-hand side) of assignment statements may also be object members. You will need to use the receiver object pointer to handle this kind of assignment. Remember that all variables referenced without receivers may either be local variables or members of the current object. This means that you need to determine which case applies and then either access the local variable as described above, or use the self pointer and the member offset to access the variable. This can also apply to the left-hand side of an assignment, even where it only has a single identifier.

The self pointer is a pointer to the current object. It will always be the first parameter to every method, and you can always access it at offset 8 above the base pointer. The Main_main method (the entry method) will not have a self pointer, but this won't cause issues since the Main class cannot have any members to access.

Method calls will call the appropriate ClassName_MethodName label after pushing the appropriate arguments. Make sure to push the arguments to the stack in reverse order (part of the __cdecl calling convention). The first argument to every method in the program except for Main_main will be the self pointer, which is a pointer to the receiver object that the method is being called on. This pointer is necessary to allow the called method to access members on the current object. Therefore, you need to push the pointer to the receiver object as the first parameter. If the method has no explicit receiver (has the form method(args)), then the receiver and the self pointer will be the current object self pointer.

After a method returns, you need to make sure to clean up the arguments from the stack. These two components are called the pre-call sequence and the post-return sequence and should happen in any method call expressions/statements. Make sure that you also save and restore the caller-save registers.

You also need to handle inheritance of objects. If a method does not exist for the class of an object, you need to continue examining parent objects until the method is found. Remember that a method can exist on the immediate parent, or any subsequent parent in the hierarchy. You will need to call the appropriate label for the function that is found. There is no method overloading, so a method with a given name will only appear once in the hierarchy for a given object.

Methods in the language are called with static dispatch, and we will select the appropriate method to call at compile time.

Objects also inherit members from their parent classes. When allocating the object, and figuring out object member offsets, this must be taken into consideration and handled correctly. One possible solution is to add the parent class members to the symbol table for the child class. However, you are not required to do this, you are only required to solve the problem somehow.

In addition to the above, you need to implement any other visitor functions which need to emit code for valid test cases. This includes incomplete expressions, other statements (like print), and any other applicable functions. To implement the print statement, you will call the standard library printf function.

Calling printf is done in a similar fashion to malloc, but has different arguments. printf will take two arguments, namely a format string and an integer to print. This can be thought of as the same as the equivalent call in c: printf("%d\n", i). This means that you need to push the expression value that is to be printed, and then push the format string. The way that we will handle the format string is by storing a string constant in the .data segment of the program, then referring to that address. The data segment part might look like this (at the very top of the program):

.data
printstr: .asciz "%d\n"

We then need to specify that we would like to change back to the .text or code segment. This can be done as follows:

.data
printstr: .asciz "%d\n"

.text
.globl Main_main
... (your assembly code for the input program here)

Now that we have created a format string, we can push its address to the stack as the first argument of printf. This is done by referring to it with a dollar sign, like this: push $printstr. Therefore a full call to printf might look like:

... (push second argument here, which is the expression result)
push $printstr
call printf

Running the Code

Once you have implemented all the necessary visitor functions for valid programs, you should test your outputted assembly code! The commands to generate and assemble the code are the same as listed above:

./lang < test.lang > code.s
gcc -m32 -o test tester.c code.s

Then you can run the file called test to execute your generated assembly code. All print statements should print their values.

The makefile includes a target test that will run the above compilation for you. Just type make test, and it will compile and run the test.lang through your compiler then through gcc. You can then type ./test to run that executable. You can put whatever program you like into the test.lang file, including copying one of the provided test cases.

Other Project Notes and Tips

The registers all start with the "e" character, which denotes that it is an "extended register". This means that is is 32-bit (4 byte) instead of 16-bit (2 byte). Make sure to use the e prefix for all the registers. The general purpose instructions (instructions without any suffix) will automatically determine the correct width based on the arguments. So if you use an extended register, they will automatically go to the long version of the instruction. If you are having trouble with one of your instructions not being able to determine the operand size, or you are doing some kind of operation which does not use any extended registers, you can suffix the operation with the "l" character to tell the assembler to use the long instruction. For example: pushl $1.

For this project, you may find it helpful to write some small C code and then have gcc generate assembly for that code. This can often give insight into how the calling convention and function calls work. To do this, write a very small C program and save it to a file (I'll assume it's called small.c). Then run gcc with a few flags to have it compile it to assembly:

gcc -S -O0 -m32 -o small.s small.c

The -S flag tells gcc to output assembly, the -O0 flag turns off optimizations to make the assembly more readable, and -m32 species 32-bit mode which is what we are using for the project. The -o small.s tells gcc to save it as a .s file (the normal extension for assembly) named small.s -- this is where you should look for the assembly.

To use an immediate (constant) value as an operand, prefix it with a dollar sign. For example, to set eax to 1 you could do: mov $1, %eax. You do not use this when specifying offsets. The format -4(%ebp) without the dollar sign is correct.

Mac OS X

You must add special handling of certain code generation components if you want to compile and run your assembly on Mac OS X. If you do not want to run on Mac OS X, you do not need to do this (it is not part of the grade).

First, you need to be careful about your Main_main function's label name. On Linux, the label name for the int Main_main() function in the tester file is simply Main_main (as long as it has a .globl specifier). However, on Mac OS X the label name starts with an underscore: _Main_main. This means that your Main_main function label must be prefixed with an underscore, and the .globl specifier must also have the leading underscore. Since Linux uses the label without the underscore, you may want to conditionally generate it based on what architecture your compiler is running on. One easy way to do this is using the #if, #else, and #endif preprocessor macros inside your codegeneration.cpp file. The g++/gcc and llvm compiler suites set a macro __APPLE__ if the code is being compiled on the Mac OS X architecture. You could utilize this in your CodeGeneration visitor like this:

#if __APPLE__
// Generate label with leading underscore
#else
// Generate without leading underscore
#endif

This will need to be done wherever you generate the Main_main function label and also where you generate the .globl specifier.

As with the Main_main function, the calls to malloc and printf need to use the label prefixed with an underscore. So instead of calling malloc or printf, you should call _malloc and _printf if the architecture is Mac OS X. You can do this with a preprocessor if-else. For example:

#if __APPLE__
call _malloc
#else
call malloc
#endif

Another requirement of calling standard library functions on Mac OS X is that the stack pointer must be aligned to a multiple of 16 bytes, or the program will crash with a "misaligned stack error". This means that right before you output your call instruction, the %esp pointer must be an exact multiple of 16. One possible way to do this is to mask the last 4 bits and set them to zero. This will move the stack pointer directly down to the next 16 byte boundary. However, you should save the old stack pointer so that you can restore it after the standard library function call.

Next you will need to push the arguments, as before. However, since we are not pushing 4 arguments (we are pushing 1 for malloc and 2 for printf), you need to pad the arguments as well. This means that instead of just pushing the two arguments, you might need to subtract from the stack pointer as well to make the entire stack operation a multiple of 16. A commented example of this complete assembly code is below. Remember that this only must done when assembling and running on Mac OS X.

# save the old stack pointer into
# a register to push later
mov %esp, %eax
# mask the stack pointer to make the last
# 4 bits zero (force to a multiple of 16)
and $0xFFFFFFF0, %esp
# subtract extra space to make parameter
# and stack pointer push a multiple of 16
sub $4, %esp
# push old stack pointer to restore later
push %eax
# push parameters
push $1
push $printstr
# call printf with stack pointer aligned to
# 16 byte boundary (no misaligned error)
call _printf
# remove parameters
add $8, %esp
# pop old stack poiner (this also removes 
# all other extra allocated space, since 
# it directly sets the stack pointer to
# what it was before)
pop %esp

A very similar thing can be done for malloc, except with slightly different parameters (there is one parameter instead of two). This must be done for both functions to avoid a runtime error whenever they are called.

Additionally, the compiler on Mac OS X by default generates "position independent executables" (or PIE), which means that absolute addressing should not be used. However, directly referring to the address of printstr is an absolute address. To force the compiler to generate non-PIE output, use the -Wl,-no_pie to the compiler (this passes the no_pie option to the linker). A modified compilation line for Mac OS X might look like:

gcc -Wl,-no_pie -m32 -o executable tester.c code.s

Resources

Some resources which you may find extremely useful for this project:

The caller save registers are: %eax, %ecx, and %edx. The callee save registers are: %ebx, %esi, and %edi.

Requirements

To obtain full credit for this project, your solution will need to:

  • Satisfy all Project 3, 4, and 5 requirements.
  • Note: You don't have to keep all the same offsets in your symbol table as were used in Project 5.
  • Generate code for valid and semantically correct input programs.
  • Generate code which assembles successfully and runs without crashing.
  • Execute on any input without segmentation faults or any kind of crashes.

It is highly recommended to verify your solution output for the provided test cases. We have included a script which generates expected output for the provided test cases. More information about how verify your output against the expected output can be found in the Grading section.

Make sure that you compile, run, and test your program on the CSIL server. Especially if you write your program on your own machine. Your assembly code may run completely differently on a different architecture, so it is extremely important to make sure that all above requirements are satisfied when building on CSIL.

Deliverables

You should submit to the GradeScope the files lexer.l, parser.y, typecheck.cpp, and codegeneration.cpp with complete lexer and parser implementations, complete AST building code, complete symbol table building and type checking code, and complete assembly code generation.

You also must include a README file which includes your name, perm number, email address, code buddy (if you work with one), and any issues with your solution and explanations.

You do not need to submit the test cases, makefile, or the unmodified source files.

Grading

Your grade will be based on the proportion of test cases for which your program produces assembly code which correctly implements the input program. We have provided some test cases and their expected output along with the code files.

All test cases should parse successfully with no errors and output valid x86 Assembly code.

To run your solution against the tests cases, you can type make run. This will run your solution against all the test cases in the tests folder, then will compile each assembly file generated and run it using the tester.c driver code. This will print the return value of the main in each test. The print statements in the input program should also print the correct values.

We have also included a make diff target, which will compare the expected output against what your assembly code returns for each test. If you run this command and you see no errors and no output from diff, then your generated assembly returned and printed the correct values for each test.