CS 160 Lecture Notes – Fall '20

\( \newcommand\set[1]{\left\{#1\right\}} \)

1 Introduction

1.1 How will the quarter work?

1.1.1 Lectures

In order to provide maximum flexibility to student schedules, and to minimize possible infrastructure problems. So, the lectures are going to be asynchronous, and there will be ~2.5 hours of videos expected to be watched every week. This is roughly the same amount of time as regular lectures. We are aiming to release lectures for each week on the Monday of the week.

1.1.2 Discussion Sections

Discussion sections are going to complement the lectures: The sections will be online via Zoom and recorded. The sections are going to mainly cover examples relevant to the current lecture and/or the current assignment.

The Zoom link for the discussion sections:

https://ucsb.zoom.us/j/84143344381?pwd=VTNtN08rK1ZWb29OSEd2UGFyem9XQT09

Meeting ID: 841 4334 4381 Passcode: 135207

1.1.3 Office Hours

We are going to cover all weekdays with office hours, with spreading them to accommodate different time zones. The office hours are (all in Pacific Time):

Day Time Person Meeting Link/Info
Monday 12:30-1:30 Mehmet https://meet.google.com/etu-edoi-ovz
Tuesday 11-12 Sean Zoom Meeting ID: 889 5281 3200, Password: 815434
Wednesday 12:30-1:30 Mehmet https://meet.google.com/etu-edoi-ovz
Thursday 3-4 Lawrence Zoom Meeting ID: 319 727 3015
Friday 11-12 Pingyuan Zoom Meeting Link

1.1.4 Slack

We are going to use Slack as the primary communication medium. You can ask questions on Slack (please use one thread per question), create private Slack groups for discussion, and direct message the instructors. Here are some Slack tips for this class.

1.1.5 Assessment

The assessment is based on programming projects, and weekly quizzes. There are no exams, or attendance requirement.

  1. Quizzes

    We deliberately do not have attendance in this course. However, in order to make sure that you are on top of the covered material, there are weekly quizzes. Each quiz is based on the lectures for that week and the quizzes are intended to be simple, and to measure basic knowledge about the material. The quiz for each week is going to be posted by Friday, and it will be due Saturday of the same week. The quizzes are graded in a pass/fail manner. In order to pass a quiz, you need to get at least half of the problems correct (rounded up).

  2. Programing Projects

    The primary method of assessment for this class is the programming projects. We are aiming for 7 programmng projects for this class, with average time of 10 days per assignment. Since there is no final for this class, the last programming project will be due the end of the finals week.

    The programming projects are going to be graded out of 2 points (only integer points will be awarded to programming project).

  3. Late (Slip) Day Policy

    There are other things going on in all of our lives. To acommodate emergencies and conflicting deadlines, we give you 6 slip days for the class. Here are the rules for the slip days:

    • The slip days are applied in discrete increments. So, for example, submitting an assignment late by 2 minutes, and by 23 hours both count as using 1 slip day.
    • You can use at most 3 slip days for one assignment. The main reason for this restriction is that the assignments usually depend on the solution of the previous assignment, and we want to make our solutions available as soon as possible to ensure that your progress is not impeded by not having working parts covered in the earlier parts of the class.
    • We do not accept late submissions to the last assignment because of the university deadline regarding grade submission.
    • We apply grade days automatically, and if you submitted an assignment late and do not have enough slip days, we grade the latest submission that you have enough slip days for (if there is any).

    If you have a situation that prevents you from submitting the coursework in time (such as a medical emergency), you can contact the instructor. We also drop the lowest quiz grade and the lowest assignment grade to acommodate such situations.

  4. Grading

    Here is the overall rubric for the class, assuming we have all 7 programming assignments and 9 weekly quizzes. In the case that we need to reduce the number of assignments because of some unexpected scheduling problems, we are going to let you know about the changes to the main rubric.

    Each assignment is graded out of 2, and we discard the lowest assignment score. So, the assignments are worth 12 points. Each quiz is worth 0.5 points and we discard the quiz with the lowest score. So, the quizzes are worth 4 points, and the total points for the class is 16.

    In order to pass the class, you have to:

    • Pass (get 1 out of 2) in at least 4 assignments.
    • Pass at least 5 quizzes.

    The grades are assigned according to the following table:

    Points Letter Grade
    15 A
    14.5 A-
    14 B+
    13.5 B
    13 B-
    12.5 C+
    12 C
    11.5 C-
    11 D+
    10.5 D
    9 D-

    Note that there is a gap between D and D- in the table above.

  5. Academic Integrity

    Unfortunately, this is not a pleasant topic to bring up. We need to remind you about the university policies regarding academic dishonesty, and to submit your own original work. You can look up the website of Office of Student Conduct for more details on this topic.

    An important reminder is that do not put your work on a public repository. There have been cases in the past where some students found and copied other students' code in public repositories on GitHub. Unfortunately, in such situations, it can be tricky figuring out who copied from whom, and whether both parties participated in this willingly. Using version control systems is good software engineering practice, and we do not discourage it. However, if you are using common source code hosting sites like GitHub and GitLab, they both allow you to create private repositories for freeThe former allows private repositories at least when you join the UCSB GitHub organization. and you should use private repositories for this class.

1.1.6 Prerequisites

130A and 138 prerequisites to know:

  • C++ programming
  • Tree data structure, and tree traversals
  • Finite automata (DFA, NFA) and regular expresssions
  • Pushdown automata, and context-free grammars

Also, CS 64 is listed as a prerequisite for this class. Brushing up on your assembly knowledge would be helpful since we are going to generate assembly code in this class.

1.2 Course Outline

Compilers is a vast field. We can spend the entire quarter on just a single part of it. For example, there are graduate-level classes on only type systems, or my main research area, program analysis. Since this is an introductory course, we are going to cover parts of a compiler for a simple C-like language, and dip our toes into an assortment of topics instead.

We are going to start with a simple language \(C\flat\), and compiler C1. Our simple language is going to have these features (we are going to talk about it more in detail later):

  • There is only one data type: 32-bit integers.
  • We have arithmetic and relational operations over these integers. We treat them as Booleans for the purpose of logical operations like ∨ (or), ∧ (and), ¬ (not).
  • We also have variables and assignment.
  • Finally, we have some basic control structures like conditionals, loops, and functions.

You can think of this language as C without pointers, memory allocation, or structs.

Our initial compiler for \(C\flat\) is going to have 3 stages (lexing, parsing, and code generation) We are going to go over what these stages are in later lectures. With these 3 stages, we are going to have a fully working compiler that turns our programs into machine code!

Later on, we are going to extend our compiler. We are going to have:

  • A simpler intermediate representation of the program.
  • Compiler optimizations over this intermediate representation to make the programs faster.

Depending on the pace of the course, we may also explore some topics like:

  • Register allocation
  • AST-level optimizations
  • Other parsing strategies

Since \(C\flat\) is a pretty simple language, we are going to extend \(C\flat\) to create a new language Cortado. Cortado is going to have structs, and automatic memory management. To support these new features we are going to extend the compiler, and we are going to implement a garbage collector.

Finally, having structs in our language means that now we have different data types. To make sure that our programs do not have type errors, we are going to define a type system, and a type checker in our compiler.

Also, I am going to try including some interesting and rather recent developments in each topic as resources to check out. You are not expected to delve into these topics. However, if any of these topics is interesting to you, you can contact me to discuss it or to ask for further resources.

2 Parsing

The tokens we get from the lexer are the terminal symbols of our programming language. We describe the syntax of the programming language as a context-free grammar over those terminals. Note that we now use tokens rather than ASCII characters to represent terminals in our concrete syntax description. Although the concrete syntax description uses lexemes for readability \(\mathit{whileLoop} ::= \texttt{while } \mathit{rexp} \texttt{ \{ } \mathit{block} \texttt{ \}} \) is much more readable than \( \mathit{whileLoop} ::= \texttt{< WHILE > } \mathit{rexp} \texttt{ < LBRACE > } \mathit{block} \texttt{ < RBRACE >} \), our terminals for the parser are tokens like <ID,foo>, <ASSIGN>, <IF>.

2.1 CFG Review

Let's review some CFG concepts from CS 138. By investigating how to use CFGs to recognize programming languages, we are going to describe parsing.

2.1.1 CFG as a generator

Here is an example CFG with marked ruleswe will use these numbers in our derivations to show which rules are applied when for arithmetic expressions:

\begin{align*} \mathit{exp} \in Expression & ::= [1] id \;|\; [2] \mathit{exp} \, \mathit{op} \, \mathit{exp} \;|\; [3] (\mathit{exp}) \\ \mathit{op} \in Operator & ::= [4] + \;|\; [5] - \;|\; [6] \times \;|\; [7] \div \end{align*}

We can think of this CFG as a generator of strings (by applying the derivation rules, starting from \(\mathit{exp}\). Here \( \to_n \) means we are applying derivation rule \( n \) ):

\begin{align*} \mathit{\color{blue}{exp}} & \to_{2} \mathit{\color{blue}{exp}} \, \mathit{op} \, \mathit{exp} \\ & \to_{2} \mathit{\color{blue}{exp}} \, \mathit{op} \, \mathit{exp} \, \mathit{op} \, \mathit{exp} \\ & \to_{1} x \, \mathit{\color{blue}{op}} \, \mathit{exp} \, \mathit{op} \, \mathit{exp} \\ & \to_{5} x - \mathit{\color{blue}{exp}} \, \mathit{op} \, \mathit{exp} \\ & \to_{1} x - y \, \mathit{\color{blue}{op}} \, \mathit{exp} \\ & \to_{6} x - y \times \mathit{\color{blue}{exp}} \\ & \to_{1} x - y \times z \end{align*}

Here, we the nonterminal we expand at each stage is in blue. After applying these derivation rules, we say that \( \mathit{exp} \) derives the string "x - y × z". We could also represent the derivation another way, as a tree (where we insert a node for each nonterminal we expand):

exp-parse-tree.png

Note that the start symbol where we begin the derivation is at the root of the tree, and the derived string (sequence of terminals) is read off of the leaves. A derivation tree is also called a parse tree. It shows the syntactic structure of the string being derived. A parse tree is not quite the same as an AST, parse trees encode some information that is abstracted away in an AST. We will talk about the difference between them later.

The idea of CFGs as generators is used a lot in automated test generation to generate valid test inputs according to a grammar. It is also one of the techniques we use to generate test programs to test your compilers in this class.

  1. Exercise

    For the expression grammar above, give a parse tree for the following input, also notice that there are multiple possible parse trees for this input:

    x + y - x * z + z

2.1.2 Leftmost and rightmost derivations

The derivation above is called a leftmost derivation because we always expanded the leftmost nonterminal. We could choose to expand any nonterminal in any order. For example, we could do a rightmost derivation to derive the same string:

\begin{align*} \mathit{\color{blue}{exp}} & \to_{2} \mathit{exp} \, \mathit{op} \, \mathit{\color{blue}{exp}} \\ & \to_{1} \mathit{exp} \, \mathit{\color{blue}{op}} \, z \\ & \to_{6} \mathit{\color{blue}{exp}} \, \times \, z \\ & \to_{2} \mathit{exp} \, \mathit{op} \, \mathit{\color{blue}{exp}} \times \, z \\ & \to_{1} \mathit{exp} \, \mathit{\color{blue}{op}} \, y \times \, z \\ & \to_{5} \mathit{\color{blue}{exp}} \, - \, y \times \, z \\ & \to_{1} x - y \times z \end{align*}
exp-parse-tree.png

Notice that, we still get the same parse tree. So, the parse tree is independent of the order of derivation.

2.1.3 CFG as a recognizer

We can also think of a CFG as a recognizer, rather than a generator: Given an input string, is there a derivation from the start symbol to that string? For example:

Input string: x - y × z \[ \mathit{exp} \to^{*?} x - y \times z \]

For this case, the answer is yes and we have the parse tree as the evidence. This idea is the essence of parsing: Proving that a string is recognized by a grammar by producing a parse tree showing the derivation.

2.1.4 Ambiguity

There is a fly in the ointment, the same one we ran into with lexing: ambiguity. A grammar is ambiguous if there exists at least one string that has more than one parse tree.

Let's take the first example grammar and the string "x - y * z":

exp-parse-tree.png
exp-parse-tree2.png

This is a problem for programming languages: The two parse trees have different order of operations, so they may evaluate to different values. For example, let's evaluate both parse trees under the environment \( x \mapsto 3, y \mapsto 4, z \mapsto 5 \):

For the first parse tree,

exp-parse-tree-eval.png

which evaluates to \( -1 * 5 = -15 \). For the second parse tree,

exp-parse-tree-eval2.png

which evaluates to \( 3 - 20 = -17 \). So, the syntactic structure of the program influences its meaning. This is similar to ambiguous sentences in English. Consider "She saw the man with a telescope", depending on how we parse this sentence we can get either of these meanings:

  • She used a telescope to see the man.
  • She saw the man, who was holding a telescope.

This is an important problem to resolve, we do not want our programs' results to be undetermined and depend on the whim of the compiler. This problem may come up in many places. A famous example is the /dangling else / problem. Consider the following grammar for if statements with an optional else part:

\begin{align*} \; \mathit{stmt} \; & ::= \; \texttt{if} \; \mathit{exp} \; \texttt{then} \; \mathit{stmt} \; \texttt{else} \; \mathit{stmt} \; \\ & | \; \texttt{if} \; \mathit{exp} \; \texttt{then} \; \mathit{stmt} \; \\ & | \; \mathit{assignment} \\ \end{align*}

Now consider the following program:

a := 1;
if x then if y then a := 2; else a := 3;

Suppose x is false, what is the value of a?

Answer: it depends on how we parse the program, and which if statement the else block belongs to. If we parse the second line like the following, then the a := 3 statement will be executed (because x is false, the else part of the blue if statement is executed). So, the value of a will be 3 after running the program:

dangling-else-tree-1.png

Note that I did not draw the subtrees for the assignments for the sake of brevity. Another valid parse tree is the one below, note that now the else block belongs to the red if statement:

dangling-else-tree-2.png

Now, since x is false, nothing in the red if statement is executed so the final value of a is 1. So, we got different results for the same program! Moreover, parts of the program that are executed are different. This is highly unintuitive and unreliable, so we are going to work on how to circumvent ambiguity in the next section.

  1. Exercise

    For the expression grammar above, and input from the previous exercise (which is also below), give two more parse trees.

    x + y - x * z + z

    for each of the three parse trees, give the result assuming x = 1, y = 2, z = 3.

2.1.5 Determinism

So, ambiguity prevents us from giving one exact meaning to our programs. Maybe we can try preventing ambiguity. Unfortunately, it turns out that determining whether a grammar is ambiguous is undecidable. However, we there is a characteristic that guarantees unambiguity, and is decidable efficiently: determinism.

Determinism is easiest to explain with pushdown automata (PDA). Recall that a PDA is a finite automaton with a stack; the transitions between states can depend on both the current input symbol, and the symbol on top of the stack. Also recall that, unlike DFA and NFA which are equivalent, deterministic PDA (DPDA) and nondeterministic PDA are not equivalent: PDA are strictly more powerful than DPDA. A PDA is deterministic if the finite automaton part can make exactly one transition from a state for a given input symbol & top of stack.

We know from 138 that PDA are equivalent to CFGs: We can transform PDA → CFG and CFG → PDA while recognizing the same language. Since DPDA are weaker, there must be some CFGs that cannot be expressed as DPDA. It turns out that this includes all ambiguous CFGs (along with some unambiguous ones as well).

A CFG that can be expressed as a DPDA is called a DCFG. DCFGs are guaranteed to be unambiguous, and there is a well-defined algorithm to check whether a grammar is a DCFG or not. Here is a Venn diagram showing the relationship between the languages recognized by these:

Lecture%208-Venn%20Diagram.jpeg

2.1.6 Dealing with ambiguity

Remember that ambiguity is a characteristic of the grammar, not the language: the same language can have multiple grammars that describe it, and some of them may be ambiguous while others are not.For example, the grammar for the "dangling else" problem above is ambiguous but there are unambiguous grammars for that language. Though, there are languages such that all grammars describing them are ambiguous, but this does not happen for programming language grammars in practice.

What this means is that even if we start with an ambiguous grammar for our language such as the concrete grammar given in the handout, which is definitely ambiguous as it contains the part of the arithmetic expressions above which we showed is ambiguous, we can transform that grammar into a new, deterministic (and thus unambiguous) grammar that describes the same language. of course this cannot work for all context-sensitive languages because DCFGs are strictly weaker than CFGs, but again it's true for practical programming languages.

Also, sometimes (like in the case of our classroom programming languages) programming language designers deliberately choose a syntax that is parser-friendly to generate better error messages. So, our general strategy will be to take the original, potentially ambiguous grammar and transform it to be deterministic while still describing the same language. if it turns out that the original language cannot be described with a DCFG then we will have to change that language so that it can be.

Let's look at the "dangling else" grammar again:

\begin{align*} \; \mathit{stmt} \; & ::= \; \texttt{if} \; \mathit{exp} \; \texttt{then} \; \mathit{stmt} \; \texttt{else} \; \mathit{stmt} \; \\ & | \; \texttt{if} \; \mathit{exp} \; \texttt{then} \; \mathit{stmt} \; \\ & | \; \mathit{assignment} \\ \end{align*}

the ambiguous input with two parse trees was:

a := 1;
if x then if y then a := 2; else a := 3;

We can transform the grammar as follows:

\begin{align*} \; \mathit{stmt} \; & ::= \; \texttt{if} \; \mathit{exp} \; \texttt{then} \; \mathit{withElse} \; \texttt{else} \; \mathit{stmt} \; \\ & | \; \texttt{if} \; \mathit{exp} \; \texttt{then} \; \mathit{stmt} \; \\ & | \; \mathit{assignment} \\ & \\ \; \mathit{withElse} & ::= \; \texttt{if} \; \mathit{exp} \; \texttt{then} \; \mathit{withElse} \; \texttt{else} \; \mathit{withElse} \; \\ & | \; \mathit{assignment} \end{align*}

Notice that this grammar recognizes the same language as the naïve grammar above. However, this grammar does not have a way to associate the else block with the outermost (the first) if statement (if we have an else block bound to an if statement, then the then block inside can contain only if statements with else statements). So, only the first way of parsing the program is valid. So, this grammar does not have the ambiguity the first one hasThe technique in this grammar is what C uses to resolve this ambiguity.. However, this grammar limits the programmer: we cannot specify that we want the else to be bound to the outer if. A more complete grammar resolves this problem by adding braces to delimit the blocks, indeed both our language C♭ and C use braces to overcome this restriction.

  1. Exercise

    Given the following input, show that it has at least two parse trees using the original conditional grammar and that it has a single parse tree using the new conditional grammar.

    if x then if y then if z then a := 1; else a := 2; else a := 3;
    

2.1.7 Another solution

There is another solution to resolve ambiguity: Changing the language itself (i.e., the concrete syntax) to not have the ambiguity in the first place rather than trying to find an unambiguous grammar:

  • Change the arithmetic expressions to require explicit parentheses
  • Delimit if statements with a terminating token like endif
  • etc.

Some languages like the Lisp family carry this idea to an extreme and use parentheses everywhere to make everything unambiguous. This makes writing a parser trivial, and removes the consideration for stuff like operator precedence. For example, the expression if x < y then a := 2; could be represented like the following in a Lisp:

(if (< x y)
    (:= a 2))

However, programmers commonly dislike this solution because it makes them type more, and humans are more familiar with the standard mathematical notation. since programmers are the main consumers of programming languages, this means that most languages have to work at dealing with ambiguity without messing with the language syntax too much.

2.2 Parsing Strategies

So now we know what parsing is, but how do we do it? Let's survey the landscape before we drill down on LL(k), the specific strategy we will be discussing (and implementing) for this class. There are many different strategies, but they can all be divided into two basic approaches: top-down and bottom-up. Let's look at the example expression grammar from before:

\begin{align*} \mathit{exp} \in Expression & ::= [1] id \;|\; [2] \mathit{exp} \, \mathit{op} \, \mathit{exp} \;|\; [3] (\mathit{exp}) \\ \mathit{op} \in Operator & ::= [4] + \;|\; [5] - \;|\; [6] \times \;|\; [7] \div \end{align*}

and some input like "x - y * z". Our goal is to create a parse tree (for now do not worry about the fact that this grammar is ambiguous).

2.2.1 Top-down parsing

We start from the root of the parse tree and work our way down to the leaves, selecting productions s.t. when we get to the leaves they will match the string. this seems like it involves a lot of guess-work, but clever versions of this approach won't need to guess anything.

See the lecture 9 video for top-down parsing in action. We are going to use a top-down parsing strategy in this class. However, let's discuss other parsing strategies first.

2.2.2 Bottom-up parsing

Start from the leaves (i.e., the input string) and work our way up the tree, picking productions in reverse.

See the lecture 9 video for bottom-up parsing in action.

2.2.3 Strategies based on algorithms

the parsing strategies can also be divided based on which grammars they actually work for. there are parsing algorithms that will handle any CFG: CYK, Earley, GLL, GLR, …

these algorithms are all \( O(n^3) \), which is way too expensive to be desirable for compilers that operate on large programs (millions of lines of code). they also result in parse forests instead of parse trees because of ambiguity, which means that after parsing we need some way to deterministically select which parse tree we actually want.

GLL and GLR are actually somewhat recent additions that have the nice property that their complexity depends on the grammar: if the grammar is unambiguous they are \( O(n) \); the more ambiguous the grammar is the more the asymptotic complexity grows until it hits \( O(n^3) \). this means that we can have grammars that are only a little bit ambiguous (which can be convenient in some cases) without a big penalty. however, they are fairly complicated algorithms and even when \( O(n) \) the constant factor is pretty high. still, there are commercial frontends for languages like C++ that use GLL or GLR. Also, ANTLR (a popular parser generator) generates parsers based on a variant of LL(*) with techniques from GLR from a given grammar.

Classically, compilers focus on parsing algorithms for deterministic grammars (DCFGs). There is a classic parsing algorithm called LRLR stands for Left-to-right, Rightmost derivation in reverse, a bottom-up algorithm, that can work for any deterministic grammar.

There is another classic parsing algorithm called LL(\(k\)) LL stands for Left-to-right, Leftmost derivation \( k \) is called lookahead and determines how much peeking ahead the parser needs, a top-down algorithm, that works for a large subset of deterministic grammars. The different in expressiveness between LL(\(k\)) and LR is exactly due to the fact that LR is bottom-up and LL(\(k\)) is top-down. Specifically, we have the following relationship between the classes of languages recognized by these algorithms:

\begin{align*} LL(1) \subsetneq LL(2) \subsetneq \ldots \subsetneq LR \subsetneq CFL \\ GLL = GLR = CFL \end{align*}

Even though LL(k) is not as expressive as LR, it's still a popular choice and one that can work for most programming languages. for example, both GCC and Clang/LLVM use parsers based on LL(k) for their C/C++ frontends. We will see why it is a popular choice once we describe how it works.

2.3 Recursive Descent and LL(1)

We are going to build an "almost LL(1)" parser for C♭ There is only one part of the grammar that is not LL(1) and it turns out to be LL(2) so our parser is still an LL(k) parser. Before attempting to do so, let's understand how LL parsing works. Then, we can discuss how to transform a grammar (such as the one in the handout) into something suitable for LL(1).

A common implementation strategy for top-down parsers in general is called recursive descent. there are other ways to implement them, but recursive descent is used a lot. i will describe a naive recursive descent implementation first, then show how LL(1) can make it efficient.

To illustrate the concepts, let's use the following CFG:

\[ S \; ::= \; a \, S \, a \; | \; b \, S \, b \; | \; c \]

We know that we can convert any CFG into an equivalent PDA (there are, in fact, a number of possible transformations). we will use the transformation that yields a top-down strategy (here we add derivations as translation rules, and the stack holds the nodes of the parse tree (i.e., the nonterminals) that we have not put the children of yet):

top-down-pda1.png

Here, \( q_0 \) is the starting state, $ is the end-of-stack marker, \( q_2 \) is the accepting state, and the transition labeled "x/y→z" means "on input x, pop y from the top of the stack and push z to the stack".

Note that this is not a DPDA because there are multiple possible transitions we can have from \( q_1 \) when there is an S on top of the stack. For now, we will assume that we have an oracle which tells us the right choice to make. Let's see what happens with the input abcba. We will go from \( q_0 \) to \( q_1 \), consume the input, then go to the accepting state. The interesting part is what happens to the stack while we are in \( q_1 \) (here "[xyz]abc" means we consumed "xyz" from the input and the stack has the nodes "abc"):

\begin{align*} S & \to aSa \\ & \to [a]Sa \\ & \to [a]bSba \\ & \to [ab]Sba \\ & \to [ab]cba \\ & \to [abc]ba \\ & \to [abcb]a \\ & \to [abcba] \end{align*}

The stack is going through a derivation of the string: We start by pushing the start symbol, then one of two things happens:

  1. the top symbol is a terminal, so we try to match it against the input
  2. the top symbol is a nonterminal, so we expand it

This builds a derivation starting from the start symbol and working towards the input string, i.e., a top-down strategy. In fact, because of the way we're pushing things on the stack we're always going to expand the leftmost nonterminal, so we're getting a leftmost derivation.

What if we didn't have the oracle to tell us the right choice? then we would need to guess, and if we guess wrong we would have to backtrack: Roll everything back to a previous guess and guess something different.

\begin{align*} S & \to aSa \\ & \to [a]Sa \\ & \to \textbf{failure: backtrack\!} \end{align*}

Notice that the PDA stack is what's keeping track of the derivation and what we need to match with the input. It turns out that we do not actually need to explicitly translate the CFG into a PDA to parse the input; Instead we can take advantage of the implicit call stack (a.k.a. function stack) that progamming languages use to enable function calls and recursion. Every time we recursively call a function we get a new instance of all the parameters and local variables. This happens because the parameters and locals are stored on the call stack; whenever a function is called a new "stack frame" is pushed onto the stack; whenever a function returns its stack frame is popped off of the stack.

We can use the call stack as the PDA stack so we can track the derivation without creating an explicit PDA. This is the core idea behind recursive descent:

Create a set of mutually recursive functions, one per nonterminal. Each function A() will have a case for each rule \( A ::= \alpha\beta\gamma \ldots \). When the function is called it will try each case in turn until successful (or it runs out of cases and signals failure).

Suppose we're in a case for the rule \( A ::= \alpha_1 \alpha_2 \ldots \alpha_n \), where each \( \alpha \) may be a terminal or nonterminal. Loop starting from \( i = 1 \):

  1. if \( \alpha_i \) is a terminal, try to match it to the current input character.
    1. if successful, consume the character and set \( i := i+1 \).
    2. if failed, backtrack to the state when the function was entered and try the next case.
  2. if \( \alpha_i \) is a nonterminal, call the corresponding function.
    1. if the function returns successfully, set \( i := i+1 \).
    2. if the function signals an error, backtrack to the state when the function was entered and try the next case.

Since \( \alpha_1 \ldots \alpha_n \) is often short, we implement the iterations of the loop above explicitly for most cases. Let's look at our example again:

\[ S \; ::= \; a \, S \, a \; | \; b \, S \, b \; | \; c \]

We have only one nonterminal \( S \), so we will have only one recursive function S() (note that this is just pseudocode):

S() {
   old_input_pos = curr_input_pos;

   try { // case: aSa
       match(a);
       S();
       match(a);
   } else try { // case: bSb
       curr_input_pos = old_input_pos;
       match(b);
       S();
       match(b);
   } else try { // case: c
       curr_input_pos = old_input_pos;
       match(c);
   } else {
       throw failure;
   }
}

match(token) {
  if (token == input[curr_input_pos]) { curr_input_pos++; }
  else throw failure;
}

Here, try { ... } else { ... } means, "try the first block, if it throws a failure, try the else block".

Let's see what happens with input abcba (here each indentation shows code inside a new stack frame):

call S()
  enter case 1
  match(a): success
  call S()
    enter case 1
    match(a): failure
    enter case 2
    match(b): success
    call S()
      enter case 1
      match(a): failure
      enter case 2
      match(b): failure
      enter case 3
      match(c): success
    return from S()
    match(b): success
  return from S()
  match(a): success
return from S()

The obvious problem with this approach is that it's extremely inefficient: Exponential in the size of the input because we create choices and possibility of backtracking potentially at each input position (in fact, for some grammarsLike \( E ::= E + E | x \) for example. it never terminates at all). The solution we are going to use for the exponential time complexity is LL(1) parsing:

2.3.1 LL(1) recursive descent

Let's go back to our PDA for our example grammar. Note that we can make it a DPDA by adding something called lookahead. First, recall our original PDA:

top-down-pda1.png

We are going to add extra states that will apply the appropriate transition after reading a lookahead symbol:

top-down-pda2.png

Here, the transitions from \( q_1 \) to \( q_a, q_b, q_c \) read the lookahead symbol, then we transition to the only state that is valid based on the lookahead symbol we read and update the stack. Then, when we go back to \( q_1 \), we remove the lookahead symbol from the stack. Try running the PDA above on the input "abcba". It's completely deterministic! What does that mean for the recursive descent version? It means no backtracking:

S() {
   if (next_token is a) { // case: aSa
       match(a);
       S();
       match(a);
   } else if (next_token is b) { // case: bSb
       match(b);
       S();
       match(b);
   } else if (next_token is c) { // case: c
       match(c);
   } else {
       throw failure;
   }
}

match(token) {
  if (token == input[curr_input_pos]) { curr_input_pos++; }
  else throw failure;
}

Let's run the code above and see what happens on input "abcba":

call S()
  enter case 1
  match(a): success
  call S()
    enter case 2
    match(b): success
    call S()
      enter case 3
      match(c): success
    return from S()
    match(b): success
  return from S()
  match(a): success
return from S()

As we can see, there are no failures or backtracking.

  1. Special case: ε

    When we have an ε production, there is no lookahead character to read. So, we just treat it as the default case (i.e., the thing to do if none of the other cases are true). Because of the structure of an LL(1) grammar, this is guaranteed to be safe. Example:

    \begin{align*} A & \; ::= \; xy \; | \; Bz\\ B & \; ::= \; wy \; | \; ε\\ \end{align*}

    Here is the implementation:

    A() {
      if (next_token is x) { match(x); match(y); }
      else { B(); match(z); }
    }
    
    B() {
      if (next_token is w) { match(w); match(y); }
    }
    
  2. Wrap-up

    A grammar with the property that we can use 1 token of lookahead to make it completely deterministic is an LL(1) grammar. We will formalize this property soon and show how we can try to transform a grammar so that it has this property. It doesn't work for all grammars (even all deterministic grammars), but it works often enough.

  3. Exercise

    Given the following grammar:

    \begin{align*} S & \; ::= \; aPb \; | \; Qc \; | \; cRd \; | \; TcP\\ P & \; ::= \; QR \; | \; TR \; | \; ε\\ Q & \; ::= \; fR \; | \; b\\ R & \; ::= \; d \; | \; gbc\\ T & \; ::= \; ea \; | \; Ra \end{align*}

    write pseudocode for a recursive descent LL(1) parser, then trace it for the inputs "afgbcdb" and "daceagbc".

2.3.2 Why is it called "LL(1)"?

If we look at the operation of the LL(1) parser, we can observe the following:

  1. it reads the input from left to right.
  2. it tracks a leftmost derivation.
  3. it uses 1 token of lookahead.

thus, it is "[L]eft-to-right, [L]eftmost derivation, [1] token of lookahead", or LL(1). We can actually use any constant number of lookahead symbols, though the more we use the more expensive things get; the general class is called LL(k). There are more general algorithms (like the one used by earlier versions of ANTLR) called LL(*) that use a DFA for lookahead instead of a constant number of tokens.

2.4 Transforming to LL(1)

We know now how to implement a suitable grammar using a recursive descent LL(1) parser; now we will talk about how to make a grammar suitable. There are 3 problems that we are going to tackle in this section:

  1. Resolving some ambiguity by encoding operator precedence in our grammar.
  2. Making sure that the parser terminates by preventing infinite recursion.
  3. Making sure that the parser can choose which production to use after reading one token by making the grammar predictive.

Note that, so far we have not covered how to build an actual parse tree or AST, we have just returned a Boolean (or thrown a failure); we will ignore building the AST for now. Once we have a suitable grammar, we are going to tweak it to build the AST.

2.4.1 Establishing precedence

The initial grammar we picked was a grammar for arithmetic expressions. Here is a shorter version of it that has only addition and multiplication (for the sake of keeping the transformations later short):

\begin{align*} E ::= \mathit{id} \; | \; E + E \; | \; E \times E \; | \; (E) \end{align*}

The cause of ambiguity with that grammar was that the grammar did not specify which operators have precedence over another, so it allowed parsing a string such as x + y × z in two ways:

  • (x + y) × z
  • x + (y × z)

We need to enforce only one of these two interpretations to disambiguate the input. We do this by specifying the relative precedence of the operators. For example, conventionally, multiplication has higher precedence than addition, so we would parse the input string as x + (y × z). There are several strategies that have been developed to enforce precedence in a grammar.For example, by specifying it along with the grammar and handling precedence separately in the parser. One example of this is Shunting Yard Algorithm. We are going to opt for a classical solution that involves modifying the grammar itself, and encoding the precedence into the structure of the grammar.

Firstly, we are going to decide the precedence of operators. In our case, let's choose the standard order of precedence (higher to lower): {()} > {×,÷} > {+,-}.

Then, we are going to create a nonterminal for each level of precedence. We can re-use our original nonterminal for the lowest level of precedence. Let's use nonterminals with some mnemonics from arithmetic: E (expression) for the lowest precedence operator +, T (term) for the next one ×, and F (factor) for the highest one, the parentheses. Nowe we are going to factor out the operations into the appropriate nonterminal for their level of precedence:

\begin{align*} E ::= \mathit{id} \; | \; E + T \; | \; T \\ T ::= T \times F \; | \; F \\ F ::= (E) \; | \; id \end{align*}

Notice the following properties:

  1. Each nonterminal rule that applies an operator has one operand that is the same nonterminal again and the other is the next highest precedence level nonterminal

    This allows the expression to have multiple of the same precedence level operators in a row. Choosing which side is which controls the associativity of the operator: Having the same nonterminal on the left makes the operator left associative; having it on the right makes the operator right associative.

  2. Each nonterminal (except the highest precedence, F) allows for applying an operator at the level of predecence or directly falling through to the next level.

    This allows the expression to not have any operators at the lower precedence level, e.g., an expression that doesn't have + in it.

  3. the base cases for expressions (identifiers, constants, etc) are always at the highest level of precedence.

    this allows the expression to just be an identifier or constant without any operators at all.

Let's look at some examples:

  1. x + y × z
  2. x + y + z
  3. x × y + z
  4. (x + y) × z

To be added after lectures: Draw the only valid parse tree for each

We have looked at this using arithmetic operators as our examples, but it applies in many other situations, including the following:

  1. relational and logical operators: !(x < y) && (z < y)
  2. subscript operators: a + b[i]
  3. type casts: (double)a / b
  1. Exercise

    Here is a toy grammar:

    A ::= A R A | p A | a
    R ::= w | x | y | z
    

    Suppose we define the following precedence levels: {p} > {w,x} > {y,z}. And say that the operators are left associative. Rewrite the grammar to enforce the correct prededence, then verify the following example inputs are handled correctly:

      p a y a w a z a x a ==> (((p a) y (a w a)) z (a x a))
    a z p a w p a x a y a ==> ((a z (((p a) w (p a)) x a)) y a)
    

    Solution:

    A ::= A y B | A z B | B
    B ::= B w C | B x C | C
    C ::= p C | a
    

2.4.2 Dealing with left recursion

Once we have established precedence, we have another problem to worry about: nontermination. This problem is an artifact of the way we are implementing the parser using recursive descent, i.e., implementing the PDA implicitly using recursive function calls. Consider the factored grammar from before:

\begin{align*} E ::= \mathit{id} \; | \; E + T \; | \; T \\ T ::= T \times F \; | \; F \\ F ::= (E) \; | \; id \end{align*}

Assume that we implement this using a recursive descent parser, as explained in a previous lecture, and let's see what happens on an example input "x + y × z".

call E()
  enter case E + T
  call E()
    enter case E + T
    call E()
    ...

Notice that we just keep recursively calling E() forever (or until we get a stack overflow). Why does this happen?

Answer: Because there is a recursive cycle in the grammar \( E \to^{*} E \) that does not consume any input tokens, i.e., there is no call to match() between invocations of E().

This is an example of what's called left recursion. a grammar is left recursive if there exists a derivation \( A \to^{*} A \alpha \) for some nonterminal \( A \). Any recursive descent parser may fail to terminate for a left recursive grammar.

  1. Direct, indirect, and hidden left recursion

    We can define three different kinds of left recursion (though some texts lump the second two into a single category):

    Direct left recursion
    there is a production of the form \( A ::= A \alpha \). There are two examples of direct left recursion in the previous grammar example.
    Indirect left recursion
    there is a set of mutually recursive productions the allow a left-recursive derivation.
    • Example:

      \begin{align*} A ::= B \; | \; \texttt{alice} \\ B ::= C \; | \; \texttt{bob} \\ C ::= A \, \texttt{charlie} \end{align*}

      Consider the derivation \( A \to B \to C \to A \texttt{charlie} \). It is left-recursive.

    Hidden left recursion
    like indirect except there is an epsilon rule that hides the left recursion.
    • Example:

      \begin{align*} A ::= B \; | \; \texttt{alice} \\ B ::= C \; | \; \texttt{bob} \\ C ::= DA \texttt{charlie} \\ D ::= \texttt{dave} \; | \; \varepsilon \end{align*}

      Consider the derivation \( A \to B \to C \to DA \texttt{charlie} \to A \texttt{charlie} \).

    In order to create an LL(1) recursive descent parser for a grammar, we must transform the grammar to remove all left recursion. We will cover how to remove each kind of left recursion below.

  2. Removing direct left recursion

    Direct left recursion is the easiest to fix: any left-recursive production can be changed to an equivalent right-recursive set of rules as follows:

    Given:

    \begin{align*} A ::= Aα \; | \; β \; | \; γ \end{align*}

    where the greek letters are sequences of terminals and nonterminals not starting with A. this grammar specifies "βα* or γα*". We can transform it into a non-left-recursive grammar that specifies the same language:

    \begin{align*} A & ::= \, βA' \; | \; γA' \\ A' & ::= \, αA' \; | \; ε \end{align*}

    We are expressing the same strings in different ways. A left recursive rule must have some non-recursive base case like β and γ aboveotherwise it would be non-productive as it cannot terminate; The left recursion is saying that we can repeat the recursive part as many times as we want (α above) and then finish with the base cases. The rewritten rules says the same thing, but puts the recursion on the right instead of the left. This means that a recursive descent parser must consume a terminal from the input before making the recursive call. So, since the input is finite, the parser must terminate.

    Note that we assume α is not ε. If \( \alpha = \varepsilon \), then we have the rule \( A ::= A \), which we can trivially delete.

    What if we have multiple left-recursive rules for A? We are going to show how to deal with 2 left-recursive rules, the idea generalizes for an arbitrary number of left-recursive rules. Given:

    \begin{align*} A ::= Aα \; | \; β \; | \; γ \end{align*}

    Then:

    \begin{align*} A & ::= \, A' \\ A' & ::= \, αA' \; | \; βA' \; | \; γε \end{align*}

    Again, there always has to be at least one rule that is not left recursive or we can trivially delete all the rules because the non-terminal is non-productive.

    1. Example 1
      E ::= E + T | E - T | T
      T ::= (E) | id
      

      Becomes

      E  ::= T E'
      E' ::= + T E' | - T E' | ε
      T  ::= (E) | id
      
      1. Exercise

        Go through parsing the input x + y - z. Notice that the operators are right-associative after the transformation.

    2. Example 2
      E ::= E + F | F
      F ::= F * G | G
      G ::= (E) | id
      

      Becomes

      E  ::= FE'
      E' ::= + FE' | ε
      F  ::= GF'
      F' ::= * GF' | ε
      G  ::= (E) | id
      
      1. Exercise

        Go through parsing the input x + y * z.

  3. Removing indirect left recursion

    There are several possible strategies for removing indirect recursion. We are going to look at a classical strategy. The basic idea is (1) to inline productions to turn the indirect recursion into direct recursion, then (2) apply the transformation that removes direct recursion.

    Let's see an example of that in action:

    \begin{align*} A & ::= \, B \; | \; \texttt{alice}\\ B & ::= \, C \; | \; \texttt{bob}\\ C & ::= \, A \texttt{charlie} \end{align*}

    We can start from \( A \), \(B \), or \( C \) to expose the indirect left recursion:

    \begin{align*} A \to B \to C \to A \texttt{charlie}\\ B \to C \to A \texttt{charlie} \to B \texttt{charlie}\\ C \to A \texttt{charlie} \to B \texttt{charlie} \to C \texttt{charlie} \end{align*}

    We are going to inline productions for one of the nonterminals. It doesn't matter which one we choose, as long as we break the left recursive cycle. Let's pick A and inline once:

    \begin{align*} A & ::= \, C \; | \; \texttt{bob} \; | \; \texttt{alice}\\ B & ::= \, C \; | \; \texttt{bob}\\ C & ::= \, A \texttt{charlie} \end{align*}

    Then inline again:

    \begin{align*} A & ::= \, A \texttt{charlie} \; | \; \texttt{bob} \; | \; \texttt{alice}\\ B & ::= \, C \; | \; \texttt{bob}\\ C & ::= \, A \texttt{charlie} \end{align*}

    Then remove the direct left recursion:

    \begin{align*} A & ::= \, \texttt{bob} A' \; | \; \texttt{alice} A'\\ A' & ::= \, \texttt{charlie} A' \; | \; ε\\ B & ::= \, C \; | \; \texttt{bob}\\ C & ::= \, A \texttt{charlie} \end{align*}

    Notice that \(B \) and \( C \) are no longer reachable from the starting nonterminal A; in general this may or may not be true.

    Let's see what happens if we had picked \( C \) instead of \(A\) to start with:

    \begin{align*} A & ::= \, B \; | \; \texttt{alice}\\ B & ::= \, C \; | \; \texttt{bob}\\ C & ::= \, B \texttt{charlie} \; | \; \texttt{alice} \texttt{charlie} \end{align*}

    and inline again:

    \begin{align*} A & ::= \, B \; | \; \texttt{alice}\\ B & ::= \, C \; | \; \texttt{bob}\\ C & ::= \, C \texttt{charlie} \; | \; \texttt{bob} \texttt{charlie} \; | \; \texttt{alice} \texttt{charlie} \end{align*}

    Now remove the direct left recursion:

    \begin{align*} A & ::= \, B \; | \; \texttt{alice}\\ B & ::= \, C \; | \; \texttt{bob}\\ C & ::= \, \texttt{bob} \texttt{charlie} C' \; | \; \texttt{alice} \texttt{charlie} C'\\ C' & ::= \, \texttt{charlie} C' \; | \; ε \end{align*}

    It can be tedious and error-prone to manually try to find left recursive cycles. We can apply an algorithm to preventatively transform the grammar so that left recursive cycles cannot possibly happen. This gives us a larger and more complex grammar than we might get if we did it manually because it transforms rules even when they are not left recursive.

    The key insight of the algorithm is to order the nonterminals arbitrarily. Then a left recursive cycle can only possibly happen if a nonterminal with order \(i\) directly derives a nonterminal with order \(j\) s.t. \(j < i\). We then inline the lower-ordered nonterminal into this rule. Continue this process for all rules, then remove any direct left recursion from the final rule set:

    Let the nonterminals be arbitrarily labeled \(A_1\ldots A_n\). then:

    For i = 1 to n:

    1. if \( A_i ::= \, A_j α \)
      • for \(\forall j.\; j < i\), inline \(A_j\) into the right-hand side \(A_i\)
    2. transform any direct left recursive rules

    Let's look at the following example again, assuming that \(A < B < C\):

    \begin{align*} A & ::= \, B \; | \; \texttt{alice}\\ B & ::= \, C \; | \; \texttt{bob}\\ C & ::= \, A \texttt{charlie} \end{align*}

    Here are the steps of the algorithm:

    1. rule \( A ::= \, B \) → CHECK
    2. rule \( A ::= \, \texttt{alice} \) → CHECK
    3. rule \( B ::= \, C \) → CHECK
    4. rule \( B ::= \, \texttt{bob} \) → CHECK
    5. rule \( C ::= \, A \texttt{charlie} \) → \(A < C\), inline: \(C ::= B \texttt{charlie} \; | \; \texttt{alice} \texttt{charlie} \)
    6. rule \( C ::= \, \texttt{alice} \texttt{charlie} \) → CHECK
    7. rule \( C ::= \, B \texttt{charlie} \) → \( B < C\), inline: \(C ::= C \texttt{charlie} \; | \; \texttt{bob} \texttt{charlie} \)
    8. rule \( C ::= \, \texttt{bob} \texttt{charlie} \) → CHECK
    9. rule \( C ::= \, C \texttt{charlie} \) → CHECK
    10. done

    Now there is one direct left recursive rule: \(C ::= \, C \texttt{charlie}\). Transform that and we're finished.

  4. Removing hidden left recursion

    Note that the algorithm for preventing indirect left recursion doesn't work if there is hidden left recusion. For example, given the following grammar:

    \begin{align*} A & ::= \, B \; | \; \texttt{alice}\\ B & ::= \, C \; | \; \texttt{bob}\\ C & ::= \, DA \texttt{charlie}\\ D & ::= \, \texttt{dave} \; | \; ε \end{align*}

    We can apply the algorithm above, and it will go through each rule, show that they all CHECK, meaning that it cannot identify hidden left recursion.

    There are two basic ways to handle hidden left recursion, which end up actually being pretty similar. The first is transform the grammar to eliminate ε rules. There is a standard algorithm to do so, which you have learned in 138 for transforming a grammar into Chomsky normal form. The second is to (1) compute which nonterminals are nullable, i.e., which nonterminals can derive ε (directly or indirectly), then (2) modify the above algorithm to take this information into account. This end up looking a lot like what you would do to remove ε rules altogether. The result of applying ε elimination to the grammar above is:

    \begin{align*} A & ::= \, B \; | \; \texttt{alice}\\ B & ::= \, C \; | \; \texttt{bob}\\ C & ::= \, DA \texttt{charlie} \; | \; A \texttt{charlie}\\ D & ::= \, \texttt{dave} \end{align*}
  5. Exercise

    Remove all left recursion in the following grammar:

    A ::= BC
    B ::= CA | b
    C ::= AA | a
    

    Solution:

    [after inlining]
        A ::= BC
        B ::= CA | b
        C ::= CACA | bCA | a
    
    [after direct recursion elimination]
        A  ::= BC
        B  ::= CA | b
        C  ::= bCAC' | aC'
        C' ::= ACAC' | ε
    

2.4.3 Lookahead, formalized

Now that we have fixed precedence to ensure that we get the correct AST and we have removed left recursion to ensure termination, the remaining thing to worry about is making sure that the grammar is deterministic via lookahead (thus avoiding the need for backtracking).

Remember that lookahead means peeking ahead in the input to the next token(s) in order to determine which production rule to use next. We are specifically going to allow a constant amount of lookahead in our case, 1 token–except for when disambiguating assignments and function calls in C♭. which means that we will not be able to handle some grammars. For any fixed amount of lookahead we can come up with a grammar that requires more lookahead to become deterministic. However, 1-token lookahead turns out to be good enough in most cases we care about.

Our intuition is that the property we are looking for in a grammar is that for any given nonterminal A that has multiple rules, looking at the next token in the input is sufficient to determine which rule we need to pick. a simple example is our grammar from the LL(1) parsing section:

\[ S \; ::= \; a \, S \, a \; | \; b \, S \, b \; | \; c \]

If we have an \(S\) symbol and need to expand it with one of these rules, by looking at the next token in the input we can tell which rule to use: a: \(aSa\), b: \(bSb\), c: \(c\). For example, on the input abcba:

\begin{align*} S & \to aSa & \textrm{Lookahead: \texttt{a}, pick } S \to aSa \\ & \to abSba & \textrm{Lookahead: \texttt{b}, pick } S \to bSb \\ & \to abcba & \textrm{Lookahead: \texttt{c}, pick } S \to c \end{align*}

We say that a grammar with this property is a predictive grammar because it lets us predict which rule to apply. Let's formalize this property. First, assume that there are no ε rules; then we will expand the formalization to handle ε rules.

  1. formal property without ε

    Let α, β be strings of grammar symbols including both terminals (T), and nonterminals (NT). We are going to define a set of terminals relevant to prediction:

    \[ \textsf{FIRST}(α) = \set{ t ∈ T | α \to^{*} tγ } \]

    in other words, \textsf{FIRST}(α) is the set of the first terminals that can be derived from α with 0 or more applications of the grammar's production rules.

    There is an algorithm in to compute \textsf{FIRST} setsSee the top-down parsing chapter in Cooper & Torczon or slide 3 of this slide deck., but for simple grammars we can do it by inspection.

    • Example

      The grammar:

      \begin{align*} S & ::= \, AB \\ A & ::= \, xBw \; | \; yBz \; | \; Bwz \\ B & ::= \, 0 \; | \; 1 \end{align*}

      The first sets:

      \begin{align*} \textsf{FIRST}(S) & = \set{x,y,0,1} \\ \textsf{FIRST}(A) & = \set{x,y,0,1} \\ \textsf{FIRST}(B) & = \set{0,1} \\ \textsf{FIRST}(xBw) & = \set{x} \\ \textsf{FIRST}(yBz) & = \set{y} \\ \textsf{FIRST}(Bwz) & = \set{0,1} \end{align*}

    A grammar can be parsed with no backtracking using a lookahead of 1 if the following holds: For any nonterminal A such that \(A ::= \, α \; | \; β \), this must be true: \( \textsf{FIRST}(α) \cap \textsf{FIRST}(β) = \emptyset \).

    During parsing when we are expanding an A node, we just need to look at the next token \( t \) in the input to determine which production to use (i.e., the one that has \(t\) in its \(\textsf{FIRST}\) set). Since the formal property says that the \(\textsf{FIRST}\) sets are disjoint, we know there is only one production that we can take.

  2. formal property with ε

    Now, suppose that we do have ε productions (which is likely if we needed to remove left recursion).

    1. Example 1
    \begin{align*} S & ::= \, AB \\ A & ::= \, xBw \; | \; yBz \; | \; Bwz \\ B & ::= \, 0 \; | \; 1 \; | \; ε \end{align*} \begin{align*} \textsf{FIRST}(S) & = \set{x,y,0,1,w} \\ \textsf{FIRST}(A) & = \set{x,y,0,1,w} \\ \textsf{FIRST}(B) & = \set{0,1} \\ \textsf{FIRST}(xBw) & = \set{x} \\ \textsf{FIRST}(yBz) & = \set{y} \\ \textsf{FIRST}(Bwz) & = \set{0,1,w} \end{align*}

    Mostly we can just "read past" the ε to fill in the \(\textsf{FIRST}\) sets, but sometimes this isn't sufficient.

    Example 2

    \begin{align*} A & ::= \, xBA \; | \; f \\ B & ::= \, xwB \; | \; ε \end{align*} \begin{align*} \textsf{FIRST}(A) & = \set{x,f} \\ \textsf{FIRST}(B) & = \set{x} \\ \textsf{FIRST}(xBA) & = \set{x} \\ \textsf{FIRST}(f) & = \set{f} \\ \textsf{FIRST}(xwB) & = \set{x} \\ \textsf{FIRST}(ε) & = \set{} \end{align*}

    At a glance, this looks predictive: for both A and B productions, the \(\textsf{FIRST}\) sets of the alternative productions are disjoint. But, consider parsing the input xxf, which is in the language:

    1. A expands to xBA
    2. consume x from input
    3. B expands to xwB
    4. consume x from input
    5. ERROR: terminal w doesn't match token f

    The correct way to parse this would be to expand B to ε, then A to xBA, then B to ε, then A to f. But we cannot figure this out just by looking at the next token in the input. The problem is that the ε allows us to "throw away" the nonterminal at any time, and we do not know when we should do that or not.

    Enter \(\textsf{FOLLOW}\) sets. For any nonterminal \(A\), \(\textsf{FOLLOW}(A)\) is the set of first terminals that can immediately follow any expansion of A. There is also an algorithm for computing this, but again for simple grammars we can do it by hand.See the same resources as the FIRST sets for the algorithm, if you are curious about it. For the example above:

    \begin{align*} \textsf{FOLLOW}(A) & = \set{} \\ \textsf{FOLLOW}(B) & = \set{x,f} \end{align*}

    Now we revisit the criteria for being a predictable grammar. In addition to the original requirement for \(A ::= \, α \; | \; β\), we add the following:

    If \(α \to^* ε\) then \(\textsf{FIRST}(β) \cap \textsf{FOLLOW}(A) = \emptyset \)

    In other words, we can tell by looking at the next token whether we should expand A to something or use the ε to throw A away.

    Consider our previous example and note that \( \textsf{FIRST}(xwB) \cap \textsf{FOLLOW}(B) = \set{x} \cap \set{x,f} = \set{x} \ne \emptyset \). Thus, this grammar is not predictive.

  3. Exercise

    compute the \(\textsf{FIRST}\) and \(\textsf{FOLLOW}\) sets of the following grammar and confirm that it is not predictive:

    \begin{align*} A & ::= \, xBy \; | \; Bx \; | \; zCx \\ B & ::= \, wB \; | \; ε \\ C & ::= \, wC \; | \; DB \\ D & ::= \, yD \; | \; ε \end{align*}
    1. Solution
      Nonterminal FIRST FOLLOW
      A w,x,z
      B w x,y
      C w,y x
      D y w
      • for A, \(\textsf{FIRST}(xBy)\) and \(\textsf{FIRST}(Bx)\) are not disjoint (both have x)
      • for C, \(\textsf{FIRST}(wC)\) and \(\textsf{FIRST}(DB)\) are not disjoint (both have w)

2.4.4 Left factoring

If a grammar is not predictive, sometimes we can transform it to make it predictive. One heuristic we can use is a transformation called left factoring.

Suppose we add arrays and functions to expressions (here \(\mathit{Factor}\) is the nonterminal for arguments of multiplication after establishing precedence in our grammar):

\begin{align*} Factor & ::= \, Id \; | \; Id [ ExpList ] \; | \; Id ( ExpList ) \\ ExpList & ::= \, Exp, ExpList \; | \; Exp \end{align*}

This is not predictive, but we can make it predictive:

\begin{align*} Factor & ::= \, Id Args \\ Args & ::= \, [ ExpList ] \; | \; ( ExpList ) \; | \; ε \\ ExpList & ::= \, Exp, ExpList \; | \; Exp \end{align*}

this transformation is called left-factoring. We are "factoring out" the leftmost common parts of the rules for a nonterminal, so all the choices are moved to the immediate beginning of a new nonterminal. In general, suppose we have:

\begin{align*} A & ::= \, αβ_1 \; | \; αβ_2 \; | \; αβ_3 \end{align*}

We are going to factor out the common prefix α, and create a new nonterminal \(Z\) for the suffixes:

\begin{align*} A & ::= \, αZ \\ Z & ::= \, β_1 \; | \; β_2 \; | \; β_3 \end{align*}

The overall algorithm to apply left factoring is iteratively applying the transformation above for all non-terminals until we reach a fixpoint (there are no more common prefixes to factor out). Note that not all grammars can be made predictive (otherwise LL(1) parsers could parse any context-free language including ambiguous ones), and it is undecidable to figure out whether one can be.

  1. Note about ε elimination

    One might think that we could get rid of \(\textsf{FOLLOW}\) sets by eliminating ε from the grammar as we said was possible when we discussed dealing with left recursion. The problem is that this elimination can render the grammar non-predictive, and applying left factoring to make it predictive again re-introduces the ε.

    Example:

    \begin{align*} T & ::= \, F T' \\ T' & ::= \, × F T' \; | \; ÷ F T' \; | \; ε \end{align*}

    Remove ε:

    \begin{align*} T & ::= \, F T' \; | \; F \\ T' & ::= \, × F T' \; | \; × F \; | \; ÷ F T' \; | \; ÷ F \end{align*}

    Note that the grammar is no longer predictive ( \(F\) is a common prefix of the rules for producing \(T\)) and we need to use left factoring, which reintroduces ε.

  2. Exercise

    Left-factor the following grammar:

    \begin{align*} A & ::= \, Bxy \; | \; Bxz \; | \; Bw \; | \; xyz \\ B & ::= \, p \; | \; q \end{align*}

    Solution:

    \begin{align*} A & ::= \, BC \; | \; xyz \\ B & ::= \, p \; | \; q \\ C & ::= \, xD \; | \; w \\ D & ::= \, y \; | \; z \end{align*}

2.5 Building an AST

A parse tree of a C♭ program contains more information than we really need or want; it shows exactly how a program was parsed using the grammar (and it contains information coming from the transformed grammar), but all we want is the underlying structure of the program.The additional information in the parse tree may be useful for other applications such as code formatting tools that need to keep track of the parentheses and other parsing-related information. This structure is represented in a data structure called the abstract syntax tree (AST).

Let's look at one of our arithmetic expression grammars:

\begin{align*} E & ::= \, F \, E' \\ E' & ::= \, + F \, E' \; | \; ε \\ F & ::= \, G \, F' \\ F' & ::= \, \times G \, F' \; | \; ε \\ G & ::= \, (E) \; | \; id \end{align*}

and compute the parse tree for "x + y × z":

exp-lre-parse-tree.png

This contains a lot of information that we do not care about, like exactly how we derived an identifier or an empty string. All we really need to know is the following:

exp-ast.png

This is why we call it an AST; it abstracts out the unimportant information from the parse tree and just gives us what we need. Notice that the representation above is same as how we represented expressions in assignment 1. The full AST for C♭ is defined in ast.h of the assignment template, which you should read.

How do we compute the AST during parsing? We just insert the appropriate logic into the parsing functions: Given the exact input characters they recognized, they create the appropriate AST nodes and return the root of the tree they created.

So, what we do is (1) define the AST data structure; and (2) determine how to insert the appropriate logic into the parsing functions. Let's look at this in action on the grammar above.

2.5.1 Example AST for arithmetic expressions

we can define the AST in a way that looks like a grammar, but we will interpret it as describing a tree instead. For the example grammar, we can define the following AST:

\begin{align*} E & ::= \, id \; | \; \texttt{add} \, E \, E \; | \; \texttt{mul} \, E \, E \end{align*}

Notice that we do not care about ambiguity at all for the AST definition, because the concrete syntax grammar already took care of that for us. We are just describing the data structure that will hold the result of the parse. Think of E as the abstract base class of the AST and each righthand side as a different type of node in the AST data structure (we are using plain old raw pointers here for the sake of brevity, the code for the assignments should use smart pointers):

abstract class E { ... }
class Id  : public E { string name; }
class Add : public E { E* left_child; E* right_child }
class Mul : public E { E* left_child; E* right_child }

Now, the recursive descent parser for the grammar above would be:

void E() { F(); E'(); }
void E'() {
  if (next == '+') { match(+); F(); E'(); }
}
void F() { G(); F'(); }
void F'() {
  if (next == '*') { match(*); G(); F'(); }
}
void G() {
  if (next == '(') { match('('); E(); match(')'); }
  else { match(id); }
}

To create the AST we modify the functions appropriately, and produce AST nodes for the part of the input consumed by each parsing function:

E* E() {
  a = F();
  b = E'();
  if (b != nullptr) return new Add(a, b);
  else return a;
}
E* E'() {
  if (next == '+') {
    match(+);
    a = F();
    b = E'();
    if (b != nullptr) return new Add(a, b);
    else return a;
  }
  return nullptr;
}
E* F() {
  a = G();
  b = F'();
  if (b != nullptr) return new Mul(a, b);
  else return a;
}
E* F'() {
  if (next == '*') {
    match(*);
    a = G();
    b = F'();
    if (b != nullptr) return new Mul(a, b);
    else return a;
  }
  return nullptr;
}
E* G() {
  if (next == '(') {
    match('(');
    a = E();
    match(')');
    return a;
  }
  else {
    a = match(id);
    return new Id(a);
  }
}

Let's go through the expression x + y × z:

call E():
| a = F()
|   call F():
|   | a = G()
|   |   call G():
|   |     next ≠ '(' // it is x
|   |     a = match(id); // a = 'x'
|   |   return Id(a) // returns Id(x)
|   | // now a = Id(x) in F()
|   | b = F'()
|   |   call F'():
|   |     next ≠ '*' // it is +
|   |	return nullptr
|   | // now b = nullptr
|   return a // F returns Id(x)
| // a = Id(x) in E
| b = E'()
|   call E'():
|   | next == '+'
|   | match(+)
|   | a = F()
|   |   call F():
|   |   | a = G()
|   |   |   call G():
|   |   |     next ≠ '(' // it is y
|   |   |     a = match(id); // a = 'y'
|   |   |   return Id(a) // returns Id(y)
|   |   | // now a = Id(y) in F()
|   |   | b = F'()
|   |   |   call F'():
|   |   |   | next == '*'
|   |	|   | match(*)
|   |	|   | a = G()
|   |   |   |   call G():
|   |   |   |     next ≠ '(' // it is z
|   |   |   |     a = match(id); // a = 'z'
|   |   |   |   return Id(a) // returns Id(z)
|   |	|   | b = F'()
|   |   |   |   call F'():
|   |   |   |     next ≠ '*' // end of input
|   |	|   |   return nullptr
|   |	|   | b == nullptr
|   |	|   return a // returns Id(z)
|   |   | // b == Id(z)
|   |   return Mul(a, b) // returns Mul(y, z)
|   | b = E'()
|   |   call E'():
|   |     next ≠ '+' // end of input
|   |	return nullptr
|   | // b == nullptr
|   return a // returns Mul(y, z)
| // b = Mul(y, z)
return Add(a, b) // returns Add(x, Mul(y, z))

We built the AST we want. Note that this makes all the operators right associative because of right recursion; if we want them to be left associative we need to modify the way we build the AST (in the parsers of E' and F').

2.5.2 Exercise

Given the following grammar:


A ::= A y B | A z B | B
B ::= B w C | B x C | C
C ::= p C | a

and the following AST definition:

AST ::= a | wOp AST AST | xOp AST AST | yOp AST AST | zOp AST AST | pOp AST

Apply left recursion elimination to make the grammar predictive, then write a recursive descent parser that produces an appropriate AST.

Verify the the input p a y a w a z a x a produces the AST:

exercise-ast.png

2.6 Validating the AST

Once we have the AST, we are almost finished. We still need to validate the AST to make sure it isn't nonsense (i.e., syntactically correct but semantically meaningless like the example English sentence "Colorless green ideas sleep furiously"). For our language there are only a few trivial checks that we need to perform:

  1. Is every variable declared before it's used?
  2. Is every called function defined?
  3. Is every parameter name and defined function name unique?

For (1) we just write a checker to do a recursive traversal of the AST with a set DECLARED that contains the variables declared in the current scope. it starts empty; each time we enter a block of code we add in that block's declared variables. for each statement, we look at all variables it refers to and verify that they are in the current DECLARED set.

The only thing we need to be a little careful of is to distinguish between different scopes. Consider:

int x;
int y;
if (x < y) {
  int x;
  int z;
  y := x + z;
} else {
  y := x - z;
}

There is an error here because 'z' is not in scope in the conditional false branch. There's an easy way to ensure we do it correctly: given a DECLARED set D, when the checker makes a recursive call we always pass in a copy of D rather than D itself. so in the above example, the initial D would have \(\set{x,y}\). We pass in a copy of D when making a recursive checker call to the true branch and modify the copy to contain \(\set{x,y,z}\). When the recursive call returns we throw away the copy and pass the orignal D when making a recursive checker call to the false branch. this keeps the branches separate and everything works right.

For (2) it's even easier: Just collect the set of defined functions, then go through all call statements and check that they only call one of the defined functions. We also need to check that the call uses the correct number of arguments.

For (3) all we do is go through the defined functions and check that their names are unique, then for every function's parameter list go through and check that they are all distinct from each other. Note that many languages do not require unique function names as long as the function signatures are different; we will keep things simple and require uniqueness (it will also help making the code generation phase simple).

In general the AST validation phase can be much more complicated. type checking makes up a big piece, but since we only have one type it doesn't really matter to us. So, we are going to defer these checks to the backend of our compiler for now. We are going to do all of the checks above and more when we extend our language, and introduce a type system later in class.

2.6.1 Example

Consider the following program, it has several errors:

def foo(int a, int a) : int { x := bar(42); return x; }
def foo() : int { return 42; }
output foo(1, 2);

Let's build the DECLARED sets for the functions and the variables:

For variables (check 1):

Inside foo in line 1, there is only one scope and all parameters are declared inside that block. When we are building the DECLARED set for foo's body, we have a set containing \(\set{a}\). Now, the assignment and the return expression both use the variable x but it is not in the declared set, those are the errors for the first kind.

Inside foo in line 2, the DECLARED set is ∅. Since no variables are used inside the body of that function there are no errors there.

For the global scope, there are no variables declared either so the DECLARED set is ∅. There are no statements, and the output expression does not use variables so there are no errors there either.

Now, let's check if every called function is defined (check 2):

The only declared function is foo (although it is declared twice). On line 1, there is a call to bar but bar is not defined, so that is an error.

There is a call to foo on line 3, with 2 arguments and there is a definition of foo with 2 arguments so that call is valid.

For our last check (3):

In implementation, this check will occur before checks 1 and 2 because we would make this check when building the DECLARED sets. First, for functions: There are 2 definitions of foo so that is an error. Then, for the parameters of each function: the parameter a is defined twice on line 1, that is also an error.

3 Front-end wrap-up

We have gone through a lot of steps to get through the frontend; let's recap what's going on and put it all together.

given: grammar for concrete syntax of the language

  1. decide on precedence levels
  2. factor grammar to enforce desired precedence
  3. eliminate left recursion
  4. check if grammar is predictive
    • if not, apply left factoring
    • recheck if grammar is predictive now
    • if not, the concrete grammar needs to change somehow
  5. define AST data structure
  6. translate grammar to recursive descent parser
  7. install AST-building logic into parsing functions

given: a sequence of characters representing the source code

  1. lexer: convert to sequence tokens using DFA/NFA
  2. parser: convert sequence of tokens into AST
  3. check AST for validity

Note that we have simplified some issues that happen in real-world languages like C, C++, Java, etc. These often require some creative thinking to get them to fit into the frontend framework we have covered here, mostly because of ambiguity in the lexemes and/or grammar or because lexemes are not regular. We carefully designed our language to get around these issues so the programs in our language can be parsed using the techniques we covered so far.

To give an example, in C++ the statement A * B; can be parsed in 2 ways and the parser needs to build contextual information to resolve it (so C++ is not context-free):

  • If A is a type like a class name, then it means "declare a variable B of type A *".
  • If A is a variable, then it means "call operator * method of A with argument B".

So, the parser needs to figure out whether A is a type name or not before parsing this statement. C++ compilers use hand-written parsers and build semantic information while parsing to resolve this ambiguity.

3.1 More on parsing

There are several topics related to parsing (and lexing) we have not covered. I will list them here (none of them are required for the class), so if you want to check them out, you can reach me to provide resources (or search them on the web):

Error recovery
Usually error recovery requires explicitly designing the parser to build error nodes, or using a parser generator with good error recovery capabilities. Another step that helps with error messages is to add position information to each token and AST node so the error messages can tell the programmer where the error occurs.
Parser generators
We have not covered parser generators in detail. They are tools that implement algorithms that transform a given grammar to produce a parser. You can look into ANTLR, bison, LALRPOP for some examples.
Parser combinators
Parser combinators are an idea that is popularized by the functional programming community. The idea is to build parsers, and combinators that operate over parsers so we can algebraically combine parsers to build more complicated ones. Parser combinators may be slow for non-predictive grammars (the performance depends a lot on the specific grammar). They can get more performant using memoization techniques like packrat parsers.
Parsing expression grammar (PEG)
PEGs are an alternative to CFGs for describing grammars. They are similar to CFGs however they are unambiguous by design (specifically "A | B" means "try A, if it fails try B" in PEGs so it is not same as "B | A" whereas the two are interchangeable in CFGs).
Parsing with derivatives
This is a cool idea that generalizes derivatives of regular expressions to CFGs to build parsers for arbitrary CFGs. The naive algorithm is quite slow (it takes exponential time), and the optimized version takes cubic time. However, it is still a useful tool to experiment with grammars in language design phase.

4 Code generation, naïvely

Given an AST, the code generator's job is to translate it into a sequence of assembly instructions for our target architecture. For the purposes of these lectures i will use a generic assembly language; the assignments will target 32-bit x86 assembly instructions.

4.1 Where code generation fits in

Recall from earlier in class that the compiler is only one part of the pipeline from source code to executable:

source1 --[compiler]-> assembly --[assembler]-> object file 1 ---
source2 --[compiler]-> assembly --[assembler]-> object file 2 ---\
...                                                               |--[linker]-> executable
sourceN --[compiler]-> assembly --[assembler]-> object file N .../

To give an example, let's compile a simple C program:

~/lab/cs160 $ cat square.c
int square(int x) {
  return x * x;
}
~/lab/cs160 $ gcc -c square.c 
~/lab/cs160 $ objdump -t square.o

square.o:	file format Mach-O 64-bit x86-64

SYMBOL TABLE:
0000000000000000 g     F __TEXT,__text _square

Here, the object file defines a single symbol for the linker to find: _square for our square function. So, the linker can find where square is defined when linking this file with other object files. You can use objdump to inspect object files yourselves as well. Let's look at another part of the program that uses the square function:

~/lab/cs160 $ cat main.c
int square(int x);  // note that this is just a declaration

int main(int argc, char **argv) {
  return square(argc);
}
~/lab/cs160 $ gcc -c main.c
~/lab/cs160 $ objdump -t main.o

main.o:	file format Mach-O 64-bit x86-64

SYMBOL TABLE:
0000000000000000 g     F __TEXT,__text _main
0000000000000000         *UND* _square

Here, the symbol table for the object file (this is different from the symbol tables we will build, those will also contain local variables) lists main as a function defined in this object file, and square as an undefined object (that is what *UND* means in this output). So, when we call the linker to link the two object files, the linker will match the square function defined in square.o with the square function declared in main.o:

~/lab/cs160 $ gcc square.o main.o -o example-program
~/lab/cs160 $ objdump -t example-program 
example-program:	file format Mach-O 64-bit x86-64

SYMBOL TABLE:
0000000100000000 g     F __TEXT,__text __mh_execute_header
0000000100000f90 g     F __TEXT,__text _main
0000000100000f80 g     F __TEXT,__text _square
0000000000000000         *UND* dyld_stub_binder

4.1.1 Static vs dynamic libraries

Even after using the linker it's possible we still do not have all the code. It depends on whether we're using static linking or dynamic linking for libraries.

Static libraries have a '*.a' extension (for "archive") on most Unixes. They are linked in with the other code the same way as object files.

Dynamic libraries have a '*.so' extension (for "shared object"). They are not linked in at compilation time. Instead, they are linked in at load time when the code is being put into memory. The part of the operating system responsible for finding and loading the dynamic libraries is called loader appropriately.

There are pros and cons to each approach, and both are widely used.

Static
Larger executable size, can be multiple copies in memory at once for different executables, but guaranteed to have the right version.
Dynamic
Smaller executable size, can be shared by multiple executables, but the library can get out of sync with the executable. For example, most C programs link dynamically against the standard library and other libraries that are shipped with the operating system. The standard library and other libraries are updated when the operating system is updated.

You do not have to be aware of this for writing the compiler, but you should know the difference in general. We will link statically against a bootstrap code that calls our main function, and dynamically against the C++ standard library because the bootstrap code will use it to print the output.

4.1.2 Process layout in memory

It will be useful to know how the process is laid out in memory while executing, so we can generate the code to read data from, and write to appropriate places in memory. As a side note, we are working on virtual memory, not physical memory; so, this is the memory as seen by our program, the operating system is responsible for mapping the physical memory to virtual memory, and the mapping is invisible for our program. So, the memory looks like this from our perspective:

--------- addresses go low to high in this direction ---->
(address 0) [code segment | static segment | data segment]

The code segment (also called text segment in some architectures) contains the machine code. The static segment also contains globals and constant strings, and the data segment is laid out as:

(low address) [heap | stack]

Note that this is where the errors "segmentation fault" accessing the wrong segment, although the virtual memory is vast these days and segments are more interleaved however the term has carried on and this simple mental image is useful and "stack overflow" (using up too much stack space and starting to overwrite the heap) come from.

The heap is where memory is allocated from while the program is running (e.g., when you call new or malloc); it is generally handled by the language runtime.

The stack is for procedure calls; and allocating the space for local variables. Each call will push a "stack frame" onto the stack, so each instance of a procedure will have its own stack frame. Stack manipulation is generally handled by the compiler. The stack frame is where a procedure's local variables and argument values are stored, among other things; we will get more into procedures and stack frames later.

Note that the stack grows downwards; this is typical for most architectures, including x86, but not always true (there are some where it grows up, and even some where it goes in a circle).

Two important values with respect to the stack are:

The "stack pointer"
Points to the current top of the stack.
The "frame pointer"
(aka base pointer) Points to the bottom of the current stack frame.

The part of the stack between where the stack pointer and the frame pointer point to is the current stack frame.

4.1.3 alignment and padding

Architectures generally have restrictions on alignment, i.e., how values can match up to word boundaries. When assigning memory locations to variables, we need to respect the alignment restrictions. For example, 32-bit floating point and integer values should begin on a full-word (32-bit) boundary.

A potential scheme for allocating values on boundary is: place values with identical alignment restrictions next to each other; assign offsets from most restrictive to least; if needed, insert padding to match restrictions.

We won't have to worry about this for our language because we only have one type of thing: int32. But this is something that compiler writers have to worry about in general. For example, consider the following struct in C:

struct foo { char a; int b; short c; long long int d; };

Assume an architecture where char:8, short:16, int:32, and long long int:64; chars are byte-aligned, shorts are aligned to 2 bytes (half-word), ints are word-aligned, and long long ints are double-word-aligned. So we would have to lay this out in memory as (where each cell represents a byte):

|a| | | |b|b|b|b|c|c| | | | | | |d|d|d|d|d|d|d|d|

This is why it's bad practice when C/C++ programmers try to manually index into a struct, like:

foo x{8,16,32,64};
char y = *((char*)&x + 1);

This code tries to read the first byte of x.b, but it's actually getting the padding between x.a and x.b that was inserted by the compiler, which is garbage. Note that, laying out the fields of a struct is a requirement in C's definition (and most places in C++'s definition). Other languages may not give this guarantee and re-arrange this struct so that it would take only 15 bytes rather than 24 bytes while abiding by the architecture's alignment rules.

4.2 Symbol table

During compilation it's generally useful to have a single repository of information about symbols in the program (e.g., functions and variables) so that we can generate the code that accesses such objects. Here are some example information that we would want to keep track of for code generation:

  • functions: number of parameters and their types, …
  • variables: type, memory location, array dimensions, …
  • structs and records: their fields, layout, and size, …
  • etc

We can find this information out by traversing the code whenever we need it, but it's more efficient to figure it out once and store the information to a place we can look it up whenever we need it.

Note that this information is used by the front-end, middle-end, and back-end, and that it isn't all available at the same time; the information gets filled in as it becomes available (e.g., variable types may be filled in before memory locations).

Naively we might think to implement the symbol table as a global map from symbol to info, but that doesn't quite work due to the issue of scope: Some variables are defined only for a delimited portion of the program.

4.2.1 scope

Most programming languages have the concept of scope, i.e., an area of code for which a particular variable is defined. For example, every function has its own scope, and its parameters and locals are entirely independent of any other function's parameters and locals even if they are named the same:

int foo(int a, int b) { int x = a + 1; int y = b + 2; return x * y; }
double bar(double a, double b) { double x = a - 1; double y = b - 2; return x / y; }

We need to keep the info about foo's variables separate from the info about bar's variables. The first solution we might try is just to keep a separate symbol table for each scope (i.e., each function):

foo --> [a -> ..., b -> ..., x -> ..., y -> ...]
bar --> [a -> ..., b -> ..., x -> ..., y -> ...]

This is a good start (earlier versions of C and other early programming languages use this) but it also has a problem, namely, nested scope.

4.2.2 nested scope

Scopes can be nested inside of each other, and variables in an inner scope can shadow variables in an outer scope. Consider the following example:

// SCOPE LEVEL 0
static int w;
int x;

void example(int a, int b) { // SCOPE LEVEL 1
  int c;
  { // SCOPE LEVEL 2a
    int b, z;
  }
  { // SCOPE LEVEL 2b
    int a, x;
    { // SCOPE LEVEL 3
      int c, x;
      b = a + b + c + x + w;
    }
  }
}

All modern programming languages use lexical scoping, which means that a variable always refers to its syntactically nearest definition. In other words, start from the scope the variable is being used in and see if that scope defines that variable; if so then that's the definition it refers to. Otherwise, look in the scope containing the current scope and check there. Keep going up the chain of enclosing scopes until you find the nearest one that defines the variable; that's the definition that variable use refers to.

level_0 --> level_1 --> level_2a
                    \-> level_2b --> level_3

For the above example, if we subscript the variable with the scope they are defined in we see that the assignment in the example becomes:

b_1 = a_2b + b_1 + c_3 + x_3 + w_0;

Note that there's no way that any variable used in scope 2b can refer to scope 2a because 2a does not enclose 2b.

So how do we arrange the symbol table to make the resolution possible? the basic idea is to create a new table for each scope and chain them together in a way that mirrors the scoping tree we just saw. We will use the following data structure and API for symbol table creation:

  • SymbolTable: Struct for the symbol table. Fields:
    • parent: pointer to enclosing symbol table
    • table: map from symbol to info
  • currScope: pointer to current symbol table, initialized to nullptr /empty unique_ptr
  • CreateScope():
    • creates a new symbol table
    • sets the table's parent to currScope
    • sets currScope to the new table
  • Insert(symbol, info):
    • inserts symbol and its info in currScope's table
  • Lookup(symbol): looks up symbol to get its info
    • starts in currScope and keeps looking for symbol until it's found, walking up the parents.
  • EndScope():
    • sets currScope to currScope's parent

Check out the lecture videos to see how this structure is built for the example above.

4.2.3 Implementation in our compiler

C♭ is very simple, and so is our compiler, so a full-fledged symbol table with type information is kind of overkill. All we really need to keep track of is the memory location of each variable. Note that we still do need to worry about scope and nested scopes.

We will be doing code generation using a recursive traversal of the AST as described below. as part of the recursive traversal we will pass a map from variable to memory location (we will describe how to compute the memory location in a bit). We use the currScope field of the code generator to keep the symbol table for the current scope.

Notice that this is essentially the scheme we talked about in an earlier lecture on validating the AST by checking that all variables were declared before being used.

We can get away with this scheme because we're not keeping around any other information, and once we're done with code generation for a given scope we do not need the memory location information, so we can just get rid of the maps as we leave each scope, and restore currScope to its original value.

4.2.4 Specifics of scope in C♭

Recall that the programs in C♭ are described as follows:

\begin{align*} program \in Program \; ::= \; functionDef^*a \, stmtBlock \, \texttt{output } aexp \texttt{;} \end{align*}

In C♭, each statement block creates a new scope, so each block under a conditional or loop has its own scope. There are a few more details about scopes relating to the syntax above:

  • Each function's name is included in all function bodies including its own (so we can call functions recursively, in the global scope, or in other functions).
  • Each function's body includes the parameters in its own scope. The parameters are placed in a different part of the stack than the local variables. We are going to cover that later in this class.
  • The output expression can refer to the variables in the global scope, so you should treat it as part of the global scope.
  • Similarly, the return expressions in functions can refer to the parameters and locals inside the function.

4.3 Code generation strategies

4.3.1 recursive AST traversal

A modern optimizing compiler will almost certainly translate the AST to a simpler intermediate representation (IR) before doing code generation. However, for our first compiler we will translate directly from the AST to assembly. The implementation of code generation will be based on a recursive traversal of the AST; as we traverse it we will emit the appropriate assembly so that by the time we have traversed the entire tree we have finished compilation.

Our compiler is written in C++, and the idiomatic way to do something like this in an object oriented language is called the "Visitor Design Pattern". We have covered it in the first assignment and the first discussion section. Please visit those materials, if you do not remember Visitor Pattern.

Also note that because we are directly generating code from the AST rather than translating to an IR and then optimizing, the code we generate will be very inefficient with lots of redundant loads and stores; this is the price we pay for a simple, uniform code generation strategy. Later in class, we are going to cover how to implement optimizations over an IR to generate more efficient code.

There are two popular categories of machines that we can generate code for: stack-based, and register-based.

4.3.2 Stack-based machines

Often we can (or must) treat the target architecture as a stack machine: there is a system stack (logically distinct from the function stack/call stack) and when generating code, the operands and result are always on the stack and do not need to be explicitly stated. For those familiar with it, the code looks like reverse polish notation. For example, the following statement

x = (1 + 2) * 3

would generate the following instruction listing:

push 1
push 2
add
push 3
mul
store [address of x]

The JVM is an example of a stack machine, as is the CLRCommon Language Runtime, the virtual machine that C# and other languages on .Net platform compile to.. The x86 architecture is not really a stack machine, but we could repurpose the function stack and treat it as a stack machine if we wanted to.

Advantages of stack machines:

  • easy to generate code
  • no temporary variables
  • no register allocation, smaller code. This last part is seen important for cases when the target architecture has a small memory, or when the code needs to be sent across a slow or expensive channel.

Disadvantages of stack machines:

  • can be much slower than register-based because of the stack load & store operations

4.3.3 Register-based machines

A more common target architecture is a register-based machine: It uses high-speed memory integrated directly into the CPU (i.e., registers) to hold operands and results and requires instructions to explicitly state where the operands are and where the result should be put. For example, the same source code

x = (1 + 2) * 3

would generate:

mov 1 R1
mov 2 R2
add R1 R2
mov 3 R3
mul R2 R3
store R3 [memory location of x]

Here, we are using a somewhat generic assembly language modeled off of x86 but not exactly the same; when there are two operands, the operation is applied and the result is put in the second one. R1, R2, and R3 are registers.

x86, ARM, MIPS, etc are all register-based machines (though most can, as mentioned above, pretend to be stack-based machines by repurposing the function stack).

The advantages are disadvantages of register-based machines are basically the reverse of those for stack-based machines.

We will be targeting x86 as a register-based machine for our compiler. However, we are going to generate code that uses the register machine naively, rather than generating code that utilizes more registers effectively.

4.4 Naïve register-based code generation

We are going to approach code generation in a piecemeal fashion: We will start by assuming a program without any function definitions or calls. Thus, all we have is the single global block of statements. Then, we will add functions and function calls to our language. For our single-block version, we are going to work our way up from variables to different kinds of expressions, all the way to complex control flow statements.

4.5 Variable memory locations

We need to figure out where the variables declared in this block will live and put that information into our symbol table. we will put all the variables on the function stack:

  1. Note that the stack and frame pointers are initialized to the base of the stack.
  2. Compute how many variables are being declared; call this \(N\).
  3. Remember that each variable is a 4-byte integer, so we need \(4N\) bytes of space.
  4. Advance the stack pointer by \(4N\), giving us a stack frame (from the frame pointer to the stack pointer).
  5. All variables are initialized to 0, so write 0 into each word of the stack frame.
  6. In the symbol table, map each declared variable to an offset from the frame pointer.
    • Remember that the stack grows downward, so the offset will be non-positive.

We would have to worry about alignment and padding here except that everything is a 4-byte integer so it all works out.

Now whenever we need to know where a variable is stored in memory, we just look up the offset stored in the symbol table.

  1. Example

    Suppose we are generating the code to allocate the variables in the following program:

    int x;
    int y;
    x := 1 + 2;
    y := x * 3;
    output y;
    

    Here are the steps we take to generate the code and update the symbol table. For these lecture notes, we are going to use STACK_REG for the stack pointer register (%esp in x86), and FRAME_REG for the frame pointer register (%ebp in x86). Also, in our notation, [register+offset] means the address pointed to by the register's value, and offset. For example, if RR contains the value 100, then [RR+4] would mean the memory location 100+4 = 104.

    1. There are 2 variables, increase stack by \( 2 \times 4 = 8 \) bytes.

      emit instruction: sub 8 STACK_REG

    2. Initialize them both to 0.

      emit instructions:

      store 0 [FRAME_REG-0]
      store 0 [FRAME_REG-4]
      

      Note that [FRAME_REG-n] means take the value of FRAME_REG, subtract n, and access that location in memory. The offsets are non-positive because the stack grows downward (toward lower memory addresses).

    3. Add info to the symbol table:

      x --> 0   [i.e., x is at a 0 offset from the frame pointer]
      y --> -4  [i.e., x is at a -4 offset from the frame pointer]
      

    We will do this each time we enter a new scope, and call this the "preamble code" for that scope. In this example, the preamble code we emit is:

    sub 8 STACK_REG
    store 0 [FRAME_REG-0]
    store 0 [FRAME_REG-4]
    

4.6 Expressions

We are targeting a register-based machine, but we do not want to do register allocation (at least, for now) so we need to evaluate expressions assuming a very limited set of available registers. to generate code for an expression we will do a recursive traversal in post-order (that is, visit the children first, then generate code for the parent).

4.6.1 Arithmetic

let's start with a simple example expression: (1 + 2) * (3 - 4)

as an AST, this is [* [+ 1 2] [- 3 4]]

Here is a scheme we might attempt (that will turn out to be broken): when processing the *,+,- nodes we want the left subtree's value to end in one register (let's call it LEFTREG) and the right subtree's value to end in another (call it RIGHTREG). To generate code for this example, we might try the following:

call generate_aexp(* node, left):
  call generate_aexp(+ node, left):
    call generate_aexp(1 node, left):
      emit "mov 1 LEFT_REG"
    call generate_aexp(2 node, right):
      emit "mov 2 RIGHT_REG"
    emit "add RIGHT_REG LEFT_REG"
 call generate_aexp(- node, right):
   call generate_aexp(3 node, left):
     emit "mov 3 LEFT_REG"
   call generate_aexp(4 node, right):
     emit "mov 4 RIGHT_REG"
   emit "sub RIGHT_REG LEFT_REG" "mov LEFT_REG RIGHT_REG"
 emit "mul RIGHT_REG LEFT_REG"

The second parameter to generate_aexp() tells it which register the result should end up in (we can pick arbitrarily for the root node of the expression). the final emitted code is:

mov 1 LEFT_REG
mov 2 RIGHT_REG
add RIGHT_REG LEFT_REG
mov 3 LEFT_REG
mov 4 RIGHT_REG
sub RIGHT_REG LEFT_REG
mov LEFT_REG RIGHT_REG
mul RIGHT_REG LEFT_REG

We see that there is a problem: Line 03 puts the result of the add in LEFTREG, but then line 04 immediately overwrites it in order to set up computing the result of sub. If we were doing a stack machine this wouldn't matter because we do not need registers and we put the new operands to the top of the stack. If we were doing register allocation (as a separate stage) we could handle this by assuming that there are an arbitrary number of registers available and then fix it later. But we are not doing either of those things, so how do we handle it?

The only thing that can store an arbitrary number of values is memory, so we're going to have to create some memory locations to hold temporary values during expression evaluation. In other words, we need to add a set of temporary variables in addition to the other variables declared in this scope. How do we know how many to add when we enter a particular scope? We will dynamically add them to the symbol table as we generate code for an expression. To keep our strategy simple, we are going to keep track of the space needed for the current scope as allocate temporaries, then, we will go back and update the code that adjusts the stack pointer in the preamble code for this scope to allocate memory only once.

Note that we can reuse temporary variables between different expressions (e.g., for x := <aexp₁>; y := <aexp₂>; we only need max(#tmp(<aexp₁>), #tmp(<aexp₂>)) temporary variables.

Let's see how we can handle the above example now:

call generate_aexp(* node, tmp_num = 0):
  call generate_aexp(+ node, tmp_num = 1):
    call generate_aexp(1 node, tmp_num = 2):
      emit "mov 1 RESULT_REG"
    insert _tmp1 into symbol table
    emit "store RESULT_REG [_tmp1]"
    call generate_aexp(2 node, tmp_num = 2):
      emit "mov 2 RESULT_REG"
    emit "ld [_tmp1] OTHER_REG" "add OTHER_REG RESULT_REG"
    remove _tmp1 from symbol table
  insert _tmp0 into symbol table
  emit "store RESULT_REG [_tmp0]"
  call generate_aexp(- node, tmp_num = 1):
    insert _tmp1 into symbol table
    call generate_aexp(3 node, tmp_num = 2):
      emit "mov 3 RESULT_REG"
    emit "store RESULT_REG [_tmp1]"
    call generate_aexp(4 node, tmp_num = 2):
      emit "mov 4 RESULT_REG"
    emit "ld [_tmp1] OTHER_REG" "sub RESULT_REG OTHER_REG" "mov OTHER_REG RESULT_REG"
    remove _tmp1 from symbol table
  emit "ld [_tmp0] OTHER_REG" "mul OTHER_REG RESULT_REG"
  remove _tmp0 from symbol table

Notice that each call to generate_aexp puts the result of that call into RESULTREG, and because we're storing the intermediate results in temporary variables we're never overwriting anything important. Also notice that we can reuse temporary variables once we are done with them (as we do above with _tmp1). When implementing code generation for the assignment, the visitors would not carry the extra parameter so you'd keep track of the temporary variable number (tmp_num) in the code generator class.

The final emitted code is:

mov 1 RESULT_REG
store RESULT_REG [_tmp1]
mov 2 RESULT_REG
ld [_tmp1] OTHER_REG
add OTHER_REG RESULT_REG
store RESULT_REG [_tmp0]
mov 3 RESULT_REG
store RESULT_REG [_tmp1]
mov 4 RESULT_REG
ld [_tmp1] OTHER_REG
sub RESULT_REG OTHER_REG
mov OTHER_REG RESULT_REG
ld [_tmp0] OTHER_REG
mul OTHER_REG RESULT_REG

In total we needed two temporary variables (_tmp0 and _tmp1) and so needed to add two entries to the symbol table and increase the stack pointer by 8 bytes (remember that the stack pointer addition happens at the beginning of the scope, so we need to go back to that code and update it). note that we used the names _tmp<n>, which are not valid variable names in our syntax, thus avoiding name clashes and making it easy to tell which variables were created by the compiler vs the user.

So now we have it working on an example; let's generalize the algorithm for arbitrary arithmetic expressions:

generate_aexp(AST* node, int tmp_num = 0) {
  if (node is a constant number <n>) { emit "mov <n> RESULT_REG"; return; }
  if (node is a variable <x>) { emit "ld [x] RESULT_REG"; return; }

  // node must be one of +,-,*
  generate_aexp(node->left, tmp_num+1);
  insert _tmp<tmp_num> into symbol table;
  emit "store RESULT_REG [_tmp<tmp_num>]";
  generate_aexp(node->right, tmp_num+1);
  emit "ld [_tmp<tmp_num>] OTHER_REG";

  // left-hand value is in OTHER_REG, right-hand value is in RESULT_REG
  if (node is +) { emit "add OTHER_REG RESULT_REG"; return; }
  if (node is -) { emit "sub RESULT_REG OTHER_REG"; emit "mov OTHER_REG RESULT_REG"; return; }
  emit "mul OTHER_REG RESULT_REG";

  remove _tmp<tmp_num> from symbol table;
}

Remember, it's important that inserting a new temporary variable in the symbol table also adjusts the amount by which the preamble code for the enclosing scope adds to the stack pointer to create the stack frame. However, we do not have to initialize the new memory locations to 0 because we are guaranteed to always store a value to them before reading them. Here, when we remove a temporary from the symbol table, we need to adjust the next offset so the space it used to occupy becomes available. With the way the recursive calls work for this algorithm ensures that all the variables allocated between inserting and removing _tmpN are removed before _tmpN is removed, so we can just decrease the next offset field when removing a temporary variable.

Also, note that we use only 2 registers: the result register, and the other register. We are going to use %eax and %edx for these in x86 code generation.

  1. Exercise

    Generate code for the following expression:

    ((1 - 2) * (3 * 4)) + (5 - (6 + 7))
    

    The AST is:

    [+ [* [- 1 2] [* 3 4]] [- 5 [+ 6 7]]
    

    Solution:

    mov 1 RR
    store RR [_tmp2]
    mov 2 RR
    ld [_tmp2] OR
    sub RR OR
    mov OR RR
    store RR [_tmp1]
    mov 3 RR
    store RR [_tmp2]
    mov 4 RR
    ld [_tmp2] OR
    mul OR RR
    ld [_tmp1] OR
    mul OR RR
    store RR [_tmp0]
    mov 5 RR
    store RR [_tmp1]
    mov 6 RR
    store RR [_tmp2]
    mov 7 RR
    ld [_tmp2] OR
    add OR RR
    ld [_tmp1] OR
    sub RR OR
    mov OR RR
    ld [_tmp0] OR
    add OR RR
    

4.6.2 relational / logical

Now that we have handled arithmetic expressions, the relational expressions are easy. The main thing we need to do is decide how to represent the results of a relational expression (i.e., how do we represent a Boolean). We will use the same method as c/c++: a 0 is interpreted as false, a non-0 is interpreted as true. However, we are going to guarantee that result of all relational expressions will be only 0 or 1 (this makes implementing logical negation much easier on x86).

Often assembly instructions that compare values store the results as a flag rather than in a register, so we need to compensate for that. let's look at an example:

(1 < 2) && ((3 <= 4) || (5 = 6))

as a tree this is:

[&& [< 1 2] [|| [<= 3 4] [= 5 6]]]

So the process we want the code generator to go through is:

call generate_rexp(&& node, tmp_num = 0):
  call generate_rexp(< node, tmp_num = 1):
    insert _tmp1 into symbol table
    call generate_aexp(1 node, tmp_num = 2)
    emit "store RESULT_REG [_tmp1]"
    call generate_aexp(2 node, tmp_num = 2)
    emit "ld [_tmp1] OTHER_REG" "cmp RESULT_REG OTHER_REG" "setlt RESULT_REG"
    remove _tmp1 from symbol table
  insert _tmp0 into symbol table
  emit "store RESULT_REG [_tmp0]"
  call generate_rexp(|| node, tmp_num = 1):
    insert _tmp1 into symbol table // does nothing
    call generate_rexp(<= node, tmp_num = 2):
      insert _tmp2 into symbol table
      call generate_aexp(3, tmp_num = 3)
      emit "store RESULT_REG [_tmp2]"
      call generate_aexp(4, tmp_num = 3)
      emit "ld [_tmp2] OTHER_REG" "cmp RESULT_REG OTHER_REG" "setle RESULT_REG"
      remove _tmp2 from symbol table
    emit "store RESULT_REG [_tmp1]"
    call generate_rexp(= node, tmp_num = 2):
      insert _tmp2 into symbol table // does nothing
      call generate_aexp(5, tmp_num = 3)
      emit "store RESULT_REG [_tmp2]"
      call generate_aexp(6, tmp_num = 3)
      emit "ld [_tmp2] OTHER_REG" "cmp RESULT_REG OTHER_REG" "sete RESULT_REG"
      remove _tmp2 from symbol table
    emit "ld [_tmp1] OTHER_REG" "or OTHER_REG RESULT_REG"
    remove _tmp1 from symbol table
  emit "ld [_tmp0] OTHER_REG" "and OTHER_REG RESULT_REG"
  remove _tmp2 from symbol table

and the final generated code is:

[code from generate_aexp]
store RESULT_REG [_tmp1]
[code from generate_aexp]
ld [_tmp1] OTHER_REG
cmp RESULT_REG OTHER_REG
setlt RESULT_REG
store RESULT_REG [_tmp0]
[code from generate_aexp]
store RESULT_REG [_tmp2]
[code from generate_aexp]
ld [_tmp2] OTHER_REG
cmp RESULT_REG OTHER_REG
setle RESULT_REG
store RESULT_REG [_tmp1]
[code from generate_aexp]
store RESULT_REG [_tmp2]
[code from generate_aexp]
ld [_tmp2] OTHER_REG
cmp RESULT_REG OTHER_REG
sete RESULT_REG
ld [_tmp1] OTHER_REG
or OTHER_REG RESULT_REG
ld [_tmp0] OTHER_REG
and OTHER_REG RESULT_REG

Note that you need to generate slightly different code for x86 (because of the peculiarities of the architecture). It's basically like the arithmetic expressions except that the comparison instructions set a condition flag, which we then need to read in order to set a register. So the generate_rexp code is:

generate_rexp(AST* node, int tmp_num = 0) {
  if (node is a comparison) {
    generate_aexp(node->left, tmp_num+1);
    insert _tmp<tmp_num> into symbol table;
    emit "store RESULT_REG [_tmp<tmp_num>]";
    generate_aexp(node->right, tmp_num+1);
    emit "ld [_tmp<tmp_num>] OTHER_REG";
    emit "cmp RESULT_REG OTHER_REG";
    emit "set<op> RESULT_REG" // <op> is {e,lt,le} depending on the comparison being made
    remove _tmp<tmp_num> from symbol table;
  } else {
    generate_rexp(node->left, tmp_num+1);
    insert _tmp<tmp_num> into symbol table;
    emit "store RESULT_REG [_tmp<tmp_num>]";
    generate_rexp(node->right, tmp_num+1);
    emit "ld [_tmp<tmp_num>] OTHER_REG";
    emit "<op> OTHER_REG RESULT_REG" // <op> is {and,or} depending on the node
    remove _tmp<tmp_num> from symbol table;
  }
}
  1. Exercise

    Generate code for the following expression:

    ((1 = 2) || ((3 < 4) && (5 <= 6))) && ((7 < 8) || (9 = 9))
    

    The AST is:

    [&& [|| [= 1 2] [&& [< 3 4] [<= 5 6]]] [|| [< 7 8] [= 9 9]]
    

    Solution:

    mov 1 RR
    store RR [_tmp2]
    mov 2 RR
    ld [_tmp2] OR
    cmp RR OR
    sete RR
    store RR [_tmp1]
    mov 3 RR
    store RR [_tmp3]
    mov 4 RR
    ld [_tmp3] OR
    cmp RR OR
    setlt RR
    store RR [_tmp2]
    mov 5 RR
    store RR [_tmp3]
    mov 6 RR
    ld [_tmp3] OR
    cmp RR OR
    setle RR
    ld [_tmp2] OR
    and OR RR
    ld [_tmp1] OR
    or OR RR
    store RR [_tmp0]
    mov 7 RR
    store RR [_tmp2]
    mov 8 RR
    ld [_tmp2] OR
    cmp RR OR
    setlt RR
    store RR [_tmp1]
    mov 9 RR
    store RR [_tmp2]
    mov 9 RR
    ld [_tmp2] OR
    cmp RR OR
    sete RR
    ld OR [_tmp1]
    or OR RR
    ld [_tmp0] OR
    and OR RR
    

4.6.3 Short-circuiting evaluation

It can be wasteful to evaluate an entire logical expression if we already know the answer part-way through (and even harmful if the remainder of the expression has side-effects like it does in C). consider the following C expression:

if (ptr != NULL && *ptr == 42) { … }

clearly if ptr is NULL we do not want to dereference it. This is called short-circuited evaluation:

  • given <lhs> && <rhs>, if <lhs> is false then we do not need to evaluate <rhs>.
  • given <lhs> || <rhs>, if <lhs> is true then we do not need to evaluate <rhs>.

Our language C♭ doesn't allow these kinds of expressions currently, but we should understand how to implement it. the basic idea is simple: we have to insert a conditional into the middle of the expression evaluation. When evaluating a && (or ||) node, after evaluating the left-hand side we emit instructions to check whether the result is false (or true) and if so then jump past the evaluation of the right-hand side. We will save the details for when we discuss generating code for conditionals.

4.7 Assignments

Assignment is trivial: we evaluate the right-hand side using generate_aexp, which puts the result in RESULT_REG, then store the result to the memory location for the left-hand side variable.

generate_assign(lhs, rhs) {
  generate_aexp(rhs);
  emit "store RR [lhs]";
}

4.8 Conditionals

We are going to cover conditionals first without a nested scope, then extend the code generation scheme for them with nested scopes.

4.8.1 No nested scope

Let's look at an example:

if (x < 2) { x := 1; } else { x := 2; }

Depending on the outcome of the comparison, we want to execute either the true branch or the false branch. We do not necessarily know the outcome of the comparison at compile time, so we need to emit code for both branches and choose between them at runtime.

For the above example we would get (here, IF_FALSE_0 etc. are labels):

    ld [x] RR
    store RR [_tmp0]
    mov 2 RR
    ld [_tmp0] OR
    cmp RR OR
    setlt RR
    cmp 0 RR
    jmpe IF_FALSE_0
    mov 1 RR
    store RR [x]
    jmp IF_END_0
IF_FALSE_0:
    mov 2 RR
    store RR [x]
IF_END_0:

Note again the inefficiency of the code generator. We could be a bit more clever and do things like check whether our left and/or right operands are leaves and if so avoid some stores and loads, which would be better code. But we are keeping things simple so that the code generator only has to look at the current AST node and nothing else.

Generalizing it to arbitrary conditionals, we get:

generate_if(node) {
  <n> = fresh index;
  generate_rexp(node->guard);
  emit "cmp 0 RESULT_REG";
  emit "jmpe IF_FALSE_<n>";
  generate_block(node->true_branch);
  emit "jmp IF_END_<n>";
  emit "IF_FALSE_<n>:";
  generate_block(node->false_branch);
  emit "IF_END_<n>:";
}

4.8.2 Adding nested scope

Remember that each branch of a conditional introduces a new scope which can declare its own variables. That means that we need to emit the preamble code again when the code generator enters that new scope:

  1. see how many declared variables there are
  2. adjust the stack pointer accordingly
  3. initialize the new memory locations to 0
  4. update symbol table to map the newly declared variables to their offsets

and when we leave the new scope we need to adjust things back the way they were:

  1. reset the stack pointer to its old position
  2. restore the symbol table to its old value

So, the code generator for the blocks of code inside the true/false branches would look something like:

generate_block(node) {
  old_symbol_table = symbol_table;
  stack_size = node->num_declared_variables * 4; // because 4-byte integers
  emit "sub <stack_size> STACK_REG";
  insert_in_symbol_table(symbol_table, node->declared_variables);
  for each var in node->declared_variables { emit "store 0 [var]"; }
  .
  .
  .
  emit "add <stack_size> STACK_REG";
  symbol_table = old_symbol_table;
}

Remember that we cannot just take the existing symbol table and add the declared variables to it and then remove them when the block is over, because some of them could shadow existing variables in the enclosing scope. You can get around that problem in various ways (e.g., making the symbol table a stack instead of a map, where lookup scans the stack top-down looking for the nearest declared variable with the given name). For the purposes of this pseudocode, we just always pass a copy of the symbol table to each branch.

Another consideration is that an expression inside a nested scope may require more temporary variables than any expression in an enclosing block. If so, then we need to backpatch the preamble code for the nested scope to allocate more space for those temporary variables.

Example:

if (y < 2) { int x; x := x + (1 + 2); y := x; }

assume this conditional is the only statement. Then the generated code would be:

    ld [y] RR
    store RR [_tmp0] // note that this _tmp0 is outside the nested scope
    mov 2 RR
    ld [_tmp0] OR
    cmp RR OR
    setlt RR
    cmp 0 RR
    jmpe IF_FALSE_0
    sub 8 SR // space for 'x' and '_tmp1'
    store 0 [x] // initialize x
    ld [x] RR
    store RR [_tmp0] // reuse _tmp0 from enclosing scope
    mov 1 RR
    store RR [_tmp1]
    mov 2 RR
    ld [_tmp1] OR
    add OR RR
    ld [_tmp0] OR
    add OR RR
    store RR [x]
    ld [x] RR
    store RR [y]
    add 8 SR // remove space for nested declarations
    jmp IF_END_0
IF_FALSE_0:
IF_END_0:

Notice that we have allocated the space for both the local variables in the scope and the temporaries at once.

4.8.3 Exercise 1

Generate code for the following program (fully, including preamble code and all expressions and statements):

if (1 < 2) { int y; x := y; } else { x := 2; }

Solution:

    mov 1 RR
    store RR [_tmp0]
    mov 2 RR
    ld [_tmp0] OR
    cmp RR OR
    setlt RR
    cmp 0 RR
    jmpe IF_FALSE_0
    sub 4 SR
    store 0 [y]
    ld [y] RR
    store RR [x]
    add 4 SR
    jmp IF_END_0
IF_FALSE_0:
    mov 2 RR
    store RR [x]
IF_END_0:

4.8.4 Exercise 2 (short-circuited evaluation)

Generate code for the following relational expression using short-circuiting:

(1 < 1) && (2 < 3)

AST:

[&& [< 1 1] [< 2 3]]

Solution:

    mov 1 RR
    store RR [_tmp1]
    mov 1 RR
    ld [_tmp1] OR
    cmp RR OR
    setlt RR
    cmp 0 RR
    jmpe REXP_END_0
    store RR [_tmp0]
    mov 2 RR
    store RR [_tmp1]
    mov 3 RR
    ld [_tmp1] OR
    cmp RR OR
    setlt RR
    ld [_tmp0] OR
    and OR RR
REXP_END_0:

4.9 Loops

We are going to handle while loops first, then modify our solution for them to get the solution for for loops.

4.9.1 While loops

Loops are a lot like conditionals except the true branch is the body of the loop and the false branch is the end of the loop. Let's see an example:

while (x < 3) { x := x + 1; }

So, for the above example, we would get:

WHILE_START_0:
    ld [x] RR
    store RR [_tmp0]
    mov 3 RR
    ld [x] OR
    cmp RR OR
    setlt RR
    cmp 0 RR
    jmpe WHILE_END_0
    ld [x] RR
    store RR [_tmp0]
    mov 1 RR
    ld [_tmp0] OR
    add OR RR
    store RR [x]
    jmp WHILE_START_0
WHILE_END_0:

Generalizing for arbitrary while loops:

generate_while(node) {
  <n> = fresh index;
  emit "WHILE_START_<n>:";
  generate_rexp(node->guard);
  emit "cmp 0 RESULT_REG";
  emit "jmpe WHILE_END_<n>";
  generate_block(node->body);
  emit "jmp WHILE_START_<n>";
  emit "WHILE_END_<n>:";
}
  1. Nested scope

    We have the same issue here with nested scope as for conditionals, though there's only one new scope: the body of the loop. We handle it exactly the same way. Note that for loops the body can be entered many times; the preamble and cleanup code for the new scope will be executed each time. However, they are similar to the preamble and cleanup code for conditionals.

  2. Exercise

    Generate code for the following program (fully, including all preamble code, expressions, and statements):

    while (x <= 10) { int y; y := 2; x := x + y; }
    

    Solution:

    WHILE_START_0:
        // code for the loop guard
        ld [x] RR
        store RR [_tmp0]
        mov 10 RR
        ld [_tmp0] OR
        cmp RR OR
        setle RR
        cmp 0 RR
        jmpe WHILE_END_0
        // code for the loop body, notice the preamble and cleanup
        sub 4 SR
        store 0 [y]
        mov 2 RR
        store RR [y]
        ld [x] RR
        store RR [_tmp0]
        ld [y] RR
        ld [_tmp0] OR
        add OR RR
        store RR [x]
        add 4 SR
        jmp WHILE_START_0
    WHILE_END_0:
    

4.9.2 For loops

Firstly, let's look at how to rewrite a for loop in terms of while loops and extra scopes (our language does not have free blocks so we are writing in a mix of our language and C below). Here is our for loop:

for x from 1 to y * 2 {
y := x + y;
}

In our language, we mean by the loop above that x is going to take values 1 through y inclusively (it will take values \( 1,2,\ldots,2y-1,2y\) ). Also, notice that we are updating y inside the loop, this could change the loop range in other languages, but we want the loop range to be computed only once so manipulating y inside the loop does not cause any unexpected behavior such as causing an infinite loop above. However, we are going to allow x to be manipulated inside the for loop to keep code generation simple. Here is the code above written in terms of a while loop:

{ // create a new scope
int x; // declare x

x := 1; // initialize x to loop start value

_tmpLoopEnd := y * 2; // create a temporary and store the loop end value there

// create a loop for incrementing x
while x <= _tmpLoopEnd {
// insert the body of the for loop
y := x + y;

// insert the code to increment x
x := x + 1;
}
// close the scope
}

We are going to generate assembly code similar to the code above, however, we are going to go through some optimizations in our assembly code:

  • We do not need to initialize x to be 0 like other declarations, because it is immediately overwritten by the start value.
  • We are going to increment x without a temporary variable, using a load, add, and store (the code for x := x + 1 would be more verbose).
  • We are also going to generate the code for the comparison efficiently

Also, we are now creating a scope outside the loop for x and _tmpLoopEnd so we need to access the symbol table. Here is the assembly code we are going to generate the loop above:

  // allocate space for x, tmpLoopEnd, and the temporaries needed for
  // the loop start and end expressions
  sub 12 SR 
  move 1 RR // code for loop start "1"
  store RR [x] // store the loop start in [x]
  // code for loop end "y * 2"
  load [y] RR
  store RR [_tmp1]
  move 2 RR
  load [_tmp1] OR
  mul OR RR
  // store the result in _tmpLoopEnd
  move RR [_tmpLoopEnd]
FOR_GUARD_0: // label to jump back to the loop guard
  // code for computing x <= _tmpLoopEnd
  load [x] RR
  load [_tmpLoopEnd] OR
  cmp OR RR
  jmpg FOR_END_0 // jump to the end if x > _tmpLoopEnd
  // the code for the body of the loop
  [the code for y := x + y]
  // increment x then jump to the guard
  load [x] RR
  add 1 RR
  store RR [x]
  jmp FOR_GUARD_0
FOR_END_0:
  // clean-up code for the scope of the for loop
  add 12 SR

We can now generalize the code above to an arbitrary for loop. One important detail that we need to pay attention to is that we need to account for temporaries when allocating the outer stack space. We also generate the temporary variable for the loop end value just like any other variable. Here is the pseudo-code for generating for loops:

generateFor(loop) {
symbolTable.createScope();
emit "sub 4 SR"; // a place-holder instruction for allocating space
alloc_insn <- &last_instruction; // remember the last instruction so we
                                // can modify it with the correct
                                // space later
// Add loop variable and loop end to the symbol table
symbolTable.insert(loop->loopVar);
tmpLoopEnd <- symbolTable.freshTmp;
// Initialize loop variable and loop end
genAexp(loop->from);
emit "store RR [loopVar]"
genAexp(loop->to);
emit "store RR [tmpLoopEnd]"
// generate the loop guard code
N <- fresh index
emit "FOR_GUARD_<N>:"
emit "  load [x] RR"
emit "  load [_tmpLoopEnd] OR"
emit "  cmp OR RR"
emit "  jmpg FOR_END_<N>"

generateBlock(loop->body);
// increment x then jump to the guard
emit "load [x] RR"
emit "add 1 RR"
emit "store RR [x]"
emit "jmp FOR_GUARD_<N>"
// mark the end of the loop body
emit "FOR_END_<N>"

// re-calculate the space needed for this scope (this can be done
// before the body as well)
size_needed <- symbolTable.maxSizeForCurrScope * 4
update alloc_insn to "sub <size_needed> SR"
// reclaim the stack space
"add <size_needed> SR"
}

4.10 Implementing functions and calls

Functions (also called procedures in some programming languages) are an illusion: they exist as a programming language abstraction, but the actual machine has no notion of a function. On the machine, we have only jumps, conditionals, registers, and the memory. However, functions are an extremely valuable illusion for programmers:

  1. they allow code to be abstracted over the data the code is run on. for example:

    // imagine that these are complicated expressions
    (1 + 2) * 2
    (3 + 4) * 4
    
    // procedural abstraction:
    int proc(int a, int b) { return (a + b) * b; }
    
  2. If the language has higher-order functions (functions that can take, create, and return functions like ordinary values), they also allow code to be abstracted over control-flow. For example, if our language had that feature, we could have a sort function that was independent of the comparison metric:

    list = [4, 2, 6, 1, 5, 3];
    sort(list, (a, b) => { return a < b; }) : [1, 2, 3, 4, 5, 6]
    sort(list, (a, b) => { return a > b; }) : [6, 5, 4, 3, 2, 1]
    
  3. They provide modularity: code can be written and compiled independently but still linked together and run correctly. Function bodies are isolated from each other by different scopes.

The compiler's job is to take this abstraction and turn it into assembly code that preserves the intended meaning of the procedure abstraction. What is that intended meaning? In other words, what happens when a function is called?

  1. The callee gets fresh instances of its parameters and local variables (local scope, even in the presence of recursion).
  2. The argument values are copied to the parameters.
  3. Control jumps to the beginning of the callee.
  4. The callee is executed.
  5. Control jumps back to the caller site.
  6. If the function returns something, the return value is provided as the result of the function call expression.

The basic tools we use to achieve this execution are:

  1. The stack frame (or activation record)
  2. Calling conventions

4.10.1 A note about function scope in C♭

There are no global variables in the usual sense. So, when processing a function definition, we are going to have a fresh symbol table containing only defined functions (to support function calls) and parameters.

4.10.2 Stack frame (activation record)

We have talked about the function call stack before. as a reminder:

  1. part of the process memory is set aside for the stack.
  2. stack register: holds a pointer to the current top of the stack.

This means that SR holds the location of the top valid word on the stack, rather than the next available word—if we read the memory location contained in SR, we get the value of the word that's currently on top of the stack. To make space on the stack, decrement the stack pointer; to reclaim that space increment the stack pointer.

low    high
  [  xx]
     ^
     sp

We will use the stack to implement a function's functionality. The key data structure is called a stack frame (or sometimes, activation record). A stack frame will hold all the information a function needs to operate correctly:

  • memory for the local variables
  • where to return to when its done executing
  • saved values for registers the function may overwrite

How will the function know where to access this information, since it can appear anywhere on the stack? We will use the frame pointer (which lives in a register just like the stack pointer). The frame pointer points to the bottom of the current stack frame; the currently executing function can access all of its info in terms of offsets from the frame pointer.

The basic idea is that when a function is called:

  1. the old frame pointer is saved
  2. the frame pointer is set to the current stack pointer
  3. the stack pointer is incremented to allocate space for the activation record
  4. the activation record is filled in appropriately
  5. control jumps to the function
  6. when the function is done, the stack pointer is set to the frame pointer (which was the old value of the stack pointer) and the frame pointer is restored to its old value

See the lectures for an example of how the stack and frame pointers change for a function call.

Note that this is kind of like nested scope, except that we adjust the frame pointer to preserve scope (in contrast the nested scope which still has access to variables from the enclosing scope).

4.10.3 calling conventions

In order to make all of this work even when functions have been compiled separately (perhaps even by different compilers) there must be a convention that says exactly how a call will work and who is responsible for what between the caller and the callee. We need a convention to make sure that the functions are cooperating on which registers to restore, and how to pass the arguments and the return value.

As part of the calling convention, every function has a prologue and epilogue, and every function call has a pre-call sequence and a post-return sequence.

[TODO: show quick diagram to make this clear]

There is an agreed-upon protocol for the language to enforce who (caller, callee) does what to make function calls work; this is the calling convention.

Anatomy of a function call–this is a "typical" example similar to that used by C on the x86 (and specifically Linux); the details can vary from language to language and machine to machine:

Pre-call:

  • save any caller-save registers by pushing them onto the stack
  • push the call arguments onto the stack (in reverse)
  • push the return address onto the stack
  • jump to address of callee's prologue code

Prologue code:

  • push the frame pointer onto the stack
  • copy the stack pointer into the frame pointer register (frame register always points to old frame register)
  • make space on the stack for local variables (decrement stack pointer)
  • save any callee-save registers by pushing them onto the stack

Epilogue code:

  • put return value in known register
  • restore callee-save registers by popping from stack
  • deallocate local variables by moving frame pointer to stack pointer
  • restore caller's frame pointer by popping it from the stack
  • pop call site's return address from stack and jump to it

Post-return code:

  • discard call arguments from stack
  • restore caller-save registers by popping from stack
  • write return value to left-hand side of call
  • continue execution

Note that the caller-save registers only need to be pushed if they are holding values needed after the function call, and callee-save registers only need to be pushed if they may be overwritten by the callee function.

4.10.4 example

The program:

def foo(int a, int b) : int { int c; c := a+b; return c+1; }
int x;
x := foo(2, 3);
output x;

offsets, scope 0 ("global" scope): x : -4

offsets, scope foo: a : 8 b : 12 c : -4 _tmp0 : -8

We will assume the existence of a few handy instructions:

  • push decrements the stack pointer and stores its argument to the newly allocated space on the stack.
  • pop increments the stack pointer and loads the contents of the newly deallocated space into its argument
  • call pushes the current instruction address onto the stack and jumps to the given label
  • ret pops an instruction address off the stack and jumps to it

We have not targeted a specific architecture yet, so we will also assume the existence of CALLER_SAVE_REG1, CALLER_SAVE_REG2, CALLEE_SAVE_REG1, and CALLEE_SAVE_REG2, with the obvious meanings.

Entry: // entry point for entire program
  // ENTRY PROLOGUE
  push FR               // save current frame pointer
  mov SR FR             // replace old frame pointer with current stack pointer
  sub 4 SR              // allocate space for 'x'
  store 0 [FR-4]          // initialize 'x' to 0
  push CALLEE_SAVE_REG1 // save callee-save registers
  push CALLEE_SAVE_REG2

  // PRE-CALL for x := foo(2, 3)
  push CALLER_SAVE_REG1 // save caller-save registers
  push CALLER_SAVE_REG2
  push 3                // push arguments onto stack in reverse
  push 2
  call FOO

  // POST-RETURN for x := foo(2, 3)
  add 8 SR             // discard arguments from stack
  pop CALLER_SAVE_REG2 // restore caller-save registers
  pop CALLER_SAVE_REG1
  store RR [FR-4]        // store return value to memory location of 'x'

  // output x
  ld [FR-4] RR

  // ENTRY EPILOGUE
  pop CALLEE_SAVE_REG2 // restore callee-save registers
  pop CALLEE_SAVE_REG1
  mov FR SR            // restore old stack pointer, deallocating stack frame
  pop FR               // restore old frame pointer
  ret                  // pop return address and jump to it, exiting program

FOO:
  // FOO PROLOGUE
  push FR               // save current frame pointer
  mov SR FR             // replace old frame pointer with current stack pointer
  sub 8 SR              // allocate space for local variables (c and _tmp0)
  store 0 [FR-4]          // initialize 'c' to 0
  push CALLEE_SAVE_REG1 // save callee-save registers
  push CALLEE_SAVE_REG2

  // c := a+b
  ld [FR+8] RR
  store RR [FR-8]
  ld [FR+12] RR
  ld [FR-8] OR
  add OR RR
  store RR [FR-4]

  // return c+1
  ld [FR-4] RR
  store RR [FR-8]
  mov 1 RR
  ld [FR-8] OR
  add OR RR

  // FOO EPILOGUE
  pop CALLEE_SAVE_REG2 // restore callee-save registers
  pop CALLEE_SAVE_REG1
  mov FR SR            // restore old stack pointer, deallocating stack frame
  pop FR               // restore old frame pointer
  ret                  // pop return address and jump to it

[show the stack as we walk through the code]

How did the code generator know the offsets of the arguments to the callee function when they were pushed onto the stack by the caller function? Because of the calling convention, the callee always knows that the first argument is at a positive 8 offset from the current frame pointer (to pass over the saved frame pointer and return address) and subsequent arguments are at 4-byte offsets from that.

4.10.5 exercise

Translate the following program into assembly under the same assumptions as the above exercise:

def bar(int a) : int { return a+42; }
def foo(int a) : int { int x; int y; x := bar(a); y := bar(a+1); return x*y; }
int x;
x := 2;
x := foo(x);
output x+3;

Solution

offsets, scope 0: x : -4 _tmp0 : -8

offsets, scope foo: a : 8 x : -4 y : -8 _tmp0 : -12

offsets, scope bar: a : 8 _tmp0 : -4


Entry:
  // ENTRY PROLOGUE
  push FR
  mov SR FR
  sub 8 SR
  store 0 [FR-4]
  push CALLEE_SAVE_REG1
  push CALLEE_SAVE_REG2

  // x := 2
  mov 2 RR
  store RR [FR-4]

  // PRE-CALL for x := foo(x)
  push CALLER_SAVE_REG1
  push CALLER_SAVE_REG2
  ld [FR-4] RR
  push RR
  call FOO

  // POST-RETURN for x := foo(x)
  add 4 SR
  pop CALLER_SAVE_REG2
  pop CALLER_SAVE_REG1
  store RR [FR-4]

  // output x+3
  ld  [FR-4] RR
  store RR [FR-8]
  mov 3 RR
  ld  [FR-8] OR
  add OR RR

  // ENTRY EPILOGUE
  pop CALLEE_SAVE_REG2
  pop CALLEE_SAVE_REG1
  mov FR SR
  pop FR
  ret

foo:
  // FOO PROLOGUE
  push FR
  mov SR FR
  sub 12 SR
  store 0 [FR-4]
  store 0 [FR-8]
  push CALLEE_SAVE_REG1
  push CALLEE_SAVE_REG2

  // PRE-CALL for x := bar(a)
  push CALLER_SAVE_REG1
  push CALLER_SAVE_REG2
  ld [FR+8] RR
  push RR
  call bar

  // POST-RETURN for x := bar(a)
  add 4 SR
  pop CALLER_SAVE_REG2
  pop CALLER_SAVE_REG1
  store RR [FR-4]

  // PRE-CALL for y := bar(a+1)
  push CALLER_SAVE_REG1
  push CALLER_SAVE_REG2
  ld [FR+8] RR
  store RR [FR-12]
  mov 1 RR
  ld [FR-12] OR
  add OR RR
  push RR
  call bar

  // POST-RETURN for y := bar(a+1)
  add 4 SR
  pop CALLER_SAVE_REG2
  pop CALLER_SAVE_REG1
  store RR [FR-8]

  // return x*y
  ld  [FR-4] RR
  store RR [FR-12]
  ld  [FR-8] RR
  ld  [FR-12] OR
  mul OR RR

  // FOO EPILOGUE
  pop CALLEE_SAVE_REG2
  pop CALLEE_SAVE_REG1
  mov FR SR
  pop FR
  ret

bar:
  // BAR PROLOGUE
  push FR
  mov SR FR
  sub 4 SR
  push CALLEE_SAVE_REG1
  push CALLEE_SAVE_REG2

  // return a+42
  ld  [FR+8] RR
  store RR [FR-4]
  mov 42 RR
  ld  [FR-4] OR
  add OR RR

  // BAR EPILOGUE
  pop CALLEE_SAVE_REG2
  pop CALLEE_SAVE_REG1
  mov FR SR
  pop FR
  ret

4.11 Notes on x86 assembly

4.11.1 Some basic x86

we're not going to teach x86, but we will give some pointers. an excellent resource for you to look at is https://en.wikibooks.org/wiki/X86_Assembly.

When looking here, keep in mind:

  • we are using 32-bit x86
  • we are using the gnu assembler, which uses AT&T syntax

There are lots of other places to look for x86, but remember we're using GAS/AT&T syntax, not Intel syntax.

Wiki page 'x86 Register and Architecture Description':

  • we're using the 32-bit registers, so $eax, %ebx, %ecx, %edx, %esp, %ebp, etc
  • %esp is the stack pointer, %ebp is the frame pointer, %eax is the return value register

Wiki page 'GAS Syntax':

  • use the 'l' suffix to indicate we're operating on 32-bit integers
  • we're using the gnu assembler, which uses AT&T syntax. when looking at x86 resources be sure you're looking at the right ones.

Operands:

  • $k for constant k
  • %reg for register value
  • (%reg) for memory location whose address is in reg (what we write as [reg] in lecture notes)
  • k(%reg) for memory location at address (%reg + k) ([reg + k] in lecture notes.

We can only use indirection in at most one operand to the same instruction.

Examples:

  • mov 2 RR == movl $2, %eax
  • sub 8 SR == subl $8, %esp
  • ld [FR-4] RR == movl -4(%ebp), %eax
  • store RR [FR-12] == movl %eax, -12(%ebp)

When looking at disassembled code, do not worry about the header information contained in sections like ".file", ".text", etc. we will add the necessary scaffolding in the infrastructure we give you for your assignments, and describe what code to emit for the entry point of the main program.

Some other useful instructions that we will use for function calls are:

  • push <reg>
  • pop <reg>

x86 has some instructions that abbreviate the instructions for pre-calls and epilogues:

  • call <label>: push current code location onto stack then jump to label (the last part of the pre-call instruction sequence).
  • leave: copies %ebp to %esp then restores %ebp by popping from the stack (the middle part of the epilogue instruction sequence).
  • ret: pop a code location from the stack and jump to it (the last part of the epilogue instruction sequence)

x86 has a specific set of caller-save and callee-save registers. however, we only need to save them if we're actually using them and because of our naive codegen we do not actually need to worry about them.

  • use %eax for RR and %edx for OR
  • use %esp and %ebp for the stack and frame pointers
  • leave all other registers alone
  • do not need to do any caller or callee saves (except for %esp and %ebp which we have already covered)

This works because no callee function will write on the callee-save registers (and so they do not need to save them) and no caller function needs a register to be preserved over function calls (and so they do not need to save them).

  1. getting 32-bit x86 assembly for C code

    commands:

    gcc -O0 -m32 -S <file>.c -o <file>.asm
    objdump -d <file>.o
    

    -O0 says to turn off optimizations, -m32 forces 32-bit mode, -S says to compile but do not assemble or link (emit only assembly code).

    Alternatively, you can use Compiler Explorer: https://gcc.godbolt.org

    • in the left panel, put c++ code
    • in the right panel:
      • choose compiler and version (gcc or clang for x86-64)
      • set compiler options to "-O0 -m32"
      • under the Output menu, make sure "Intel asm syntax" is deselected (because we use AT&T syntax)
    1. example

      code.c:

      int foo(int x, int y) {
        int a = x * y;
        int b = a + 2;
        return b;
      }
      

      instead of the below which is using objdump, you can also use Compiler Explorer or GCC.

      output:

      00000000 <foo>:
         0:   55                      push   %ebp
         1:   89 e5                   mov    %esp,%ebp
         3:   83 ec 10                sub    $0x10,%esp
         6:   8b 45 08                mov    0x8(%ebp),%eax
         9:   0f af 45 0c             imul   0xc(%ebp),%eax
         d:   89 45 fc                mov    %eax,-0x4(%ebp)
        10:   8b 45 fc                mov    -0x4(%ebp),%eax
        13:   83 c0 02                add    $0x2,%eax
        16:   89 45 f8                mov    %eax,-0x8(%ebp)
        19:   8b 45 f8                mov    -0x8(%ebp),%eax
        1c:   c9                      leave
        1d:   c3                      ret
      

      What is going on in the code above:

      0: put old frame pointer in stack frame 1: set current frame pointer to current stack pointer 3: add space to stack for locals 6: load x from memory 9: multiply by y d: store to a 10: load a 13: add 2 16: store to b 19: load b 1c: tear down current stack frame 1d: pop return address from stack and jump to it

      Note that the memory offsets to x and y are positive (meaning below the current stack pointer) while those for a and b are negative (meaning above the stack pointer).

5 Optimization: Making programs faster

"No matter how hard you try, you cannot make a racehorse out of a pig. You can, however, make a faster pig."

  • Jamie Zawinski

5.1 Introduction to analysis and optimization

So far we have examined the frontend (lexing, parsing), and the backend (codegen). Let's spend some time looking at the middle-end and optimizations. What do we mean by "optimization"?

  • usually we mean performance: making the generated code execute faster. sometimes we want to optimize for other things, like code size, memory consumption, or power consumption (all of which are important for embedded systems). For our purposes we will focus on performance optimization.
  • "optimization" is a somewhat misleading term, because it implies we are going to find some optimal (i.e., the best) solution. In reality finding the best solution is NP-Hard or undecidable and we are just looking for a better solution (e.g., that the optimized code runs faster than the unoptimized code). There is a such thing as a "super-optimizer" which does look for the best possible solution, but this is an active area of research and current solutions cannot scale beyond tens of lines of code.

How do we optimize for performance? The general idea is to reduce the amount of computation the generated executable needs to do: to somehow eliminate instructions or to replace slower instructions with faster instructions, while still guaranteeing that the resulting program "does the same thing".

Examples

// arithmetic identities
x := y * 0;    --> x := 0;

// constant folding
x := 1 + 2;    --> x := 3;

// constant propagation and folding
x := 1 + 2;    --> x := 3;
y := x + 3;    --> y := 6;

// redundancy elimination
x := a + b;    --> x := a + b;
y := a + b;        y := x;

// strength reduction
x := y / 2;    --> x := y >> 1;

The key formula behind compiler optimizations is:

optimization = analysis + transformation

We need to use program analysis (aka static analysis) to determine what is true about the program's behavior (e.g., what things are constant values, what expressions are guaranteed to evaluate to the same value, etc), then we select code transformations that, based on the information provided by the analysis, are safe (preserve behavior) and effective (improve performance).

We can perform optimizations in the backend on the generated assembly code, but then we would have to reimplement them for each backend we add to the compiler, also the generated assembly may have peculiarities that makes the optimizations harder to reason about or computationally harder to implement. There are some optimizations that are inherently target specific, and we reserve those for the backend, but mostly we try to optimize in a target-agnostic way so that we can reuse the same optimizations for all targets. thus, the "middle-end".

The middle-end optimizations can operate directly on the AST, but it is nicer if the code is simpler and more regular; it makes the optimizations easier to specify and implement. usually, the middle-end translates the AST into a simpler "intermediate representation" (IR). It performs the optimizations on the IR, then the backend code generator translates the IR into assembly.

The optimizer can also operate at a number of different scopes, ranging from small fragments of code (local optimization) to entire functions (global optimization) to set of functions (inter-procedural optimization) to entire programs (whole-program optimizations). The larger the scope that the optimizer operates at, the more effective it is but the more complex and expensive it is because now the optimizer needs to reason about the interaction between a larger set of components.

We will focus on local optizations and global optimizations. The term "global" may seem odd since we're focusing in on one single function; the term comes from a time when local optimization was the norm and operating at an entire function scope was seen as extremely aggressive—operating on even larger scopes than that was out of the question.

5.2 Safety and profitability

We stated two important properties of optimization:

  • it should be safe (preserve program behavior)
  • it should be profitable (improve program performance)

Let's dig a little into each, because they are not as straightforward as they seem.

5.2.1 Safety

We said that optimizations should preserve behavior, but what does that mean? If you think about it, the whole point is to change program behavior by making it run faster using different instructions. What are we preserving? What does "behavior" even mean?

Roughly, we can think of the behavior we're preserving as user-observable events independent of time. In other words, if we run the original program P and record the user-observable events (e.g., outputs) without timestamps, then do the same with the optimized program P', we should get the same thing. For C♭, this would be just the final output result.

this definition will work fine for C♭, but it isn't adequate for language like C and C++. these languages have the concept of "undefined behavior". literally, this means that for certain code expressions and statements the language standard doesn't say what should happen, thus the compiler implementor is free to do whatever they want. in fact, it goes even further than that: a program that executes a statement with undefined behavior isn't defined at all; the entire program execution is undefined, not just that statement. this is important to understand: the effects of undefined behavior are not confined to the statement that is undefined, or even the execution after that statement, but the entire execution, even before that statement is executed.

Let's look at some examples (taken from the blog of John Regehr, a CS professor whose research deals a lot with undefined behavior in C).

#include <limits.h>
#include <stdio.h>

int main (void)
{
  printf ("%d\n", (INT_MAX+1) < 0);
  return 0;
}

this program asks the question: if we take the maximum integer value and add 1, do we get a negative number? in other words, do integers wrap around? so what happens when we run this program?

$ gcc test.c -o test
$ ./test
1

or:

$ gcc test.c -o test
$ ./test
0

or:

$ gcc test.c -o test
$ ./test
42

or:

$ gcc test.c -o test
$ ./test
Formatting root partition, then doing your laundry.

Any of these could happen, or anything else. The program has undefined behavior because it exhibits signed integer overflow, that is, we took a signed integer and added a value to it that caused the result to be out of the range that a signed integer can express. The C standard says this this is undefined (note that unsigned integer overflow is defined).

Now, you might think "on an x86 the C integer add instruction is implemented using the x86 ADD assembly instruction, which operates using two's complement arithmetic, which does wrap around. thus, if i'm running the program on an x86 i can expect to get 1." this is bad, for the following reasons:

  1. you do not know what architectures your program will be compiled on, even if you are compiling for x86 right now.
  2. not all compilers will work this way, even on an x86, even if the one you're using right now does.
  3. at certain optimization levels, a compiler that may work this way previously will stop working this way because it takes advantage of the undefined behavior to optimize the program.

Let's take a closer look at that third reason with an example:

int foo(int a) { return (a+1) > a; }

Note that if we pass in INT_MAX for a, then we get undefined behavior. The compiler optimizer can reason this way:

  • case 1: 'a' is not INT_MAX. then the answer is guaranteed to be 1.
  • case 2: 'a' is INT_MAX. then the behavior is undefined and we can do whatever we want. let's make the answer 1.

it then outputs the following optimized code:

int foo(int a) { return 1; }

This is completely valid, even if we're running the program on an x86 and using the x86 ADD instruction which exhibits two's complement behavior, so that INT_MAX + 1 < INT_MAX.

We have only looked at signed integer overflow, but the C99 language standard lists 191 kinds of undefined behavior. This is why it's always a good idea to use -Wall for C/C++ compilation and pay attention to the warnings.

5.2.2 Profitability

Another consideration is that the transformation we apply to the code to optimize it should actually result in faster code. This can be trickier than you might think.

Example: Function Inlining

Consider the following example:

def foo(int a) { return a + 1; }
int x;
x := foo(2);
output x;

At runtime, the execution will look something like:

precall sequence
function prologue
compute a+1
function epilogue
post-return sequence
output x

Notice that the precall, prologue, epilogue, and post-return are all overhead that comes because we're making a function call. we can optimize this code by inlining the function call, that is, by copying the body of the function directly into the caller:

int x;
x := 2 + 1;
output x;

This has exposed another optimization opportunity for constant folding and propagation, yielding the following optimized program:

output 3;

Function inlining both eliminates the overhead of function calls and also exposes many opportunities for optimization that wouldn't be obvious if we looked at functions in isolation. In fact, it is a central optimization in languages where functions are the main abstraction (such as functional languages like Scala, Haskell, and Rust). We can perform function inlining whenever we know exactly which function is being called (always in L1; usually in C except for function pointers; sometimes in C++ because of virtual methods).

You might think: inlining one function call was good, so obviously we should inline as many as possible! But this actually is not a good idea. For one, we cannot fully inline recursive calls (we would get into a cycle of infinitely repeated inlinings); but let's ignore that issue. The real problem is that inlining comes with a cost: the code becomes larger.

def foo(int a) { <instruction 1>; ... <instruction N>; return <aexp>; }
int x;
int y;
x := foo(2);
y := foo(3);
output x + y;

Becomes:

int x;
int y;
<instruction 1>;
.
.
.
<instruction N>;
<instruction 1>;
.
.
.
<instruction N>;
output x + y;

There were two calls to foo, so we doubled the amount of code inside foo's body. consider the following:

def foo(int a) { <instruction 1>; ...; bar(); ...; bar(); <instruction N>; return <aexp>; }
def bar() { <instruction 1'>; ...; <instruction M'>; return <aexp>; }
int x;
int y;
x := foo(2);
y := foo(3);
output x + y;

becomes:

int x;
int y;
<instruction 1>;
.
.
.
<instruction 1'>;
.
.
.
<instruction M'>;
.
.
.
<instruction N>;
<instruction 1>;
.
.
.
<instruction 1'>;
.
.
.
<instruction M'>;
.
.
.
<instruction N>;
output x + y;

We doubled the amount of code in foo, which, because we inlined the calls to bar, quadruples the amount of code in bar. in general, inlining can increase code size exponentially. This impacts not just memory consumption, but also performance because the large code size can blow the instruction cache and cause large slowdowns.

In practice, compilers use a set of complicated heuristics to determine whether a function call should be inlined or not. A simple example heuristic is to inline functions that are called once, or the functions that are small.

Many optimizations have similar considerations: just because we can apply the optimization doesn't mean it's a good idea. We need to figure out in each case whether applying the optimization results in a net performance benefit or loss and act accordingly.

5.3 IR: 3-address code

Let's discuss the intermediate representation (IR) we will use for the middle-end. it will be a classic example of "three address code", something similar to that used by many real compilers. The name "three address code" is a reference to the fact that no instruction will involve more than three "addresses" (think: variables). Here is the syntactic definition (the grammar) of the IR for expressions:

\begin{align*} binop & ::= \; + \; | \; - \; | \; * \; | \; < \; | \; <= \; | \; = \; | \; AND \; | \; OR \\ operand & ::= \; variable \; | \; constant \\ instruction & ::= \; variable <- \; operand \; binop \; operand \\ & \; | \; variable <- \texttt{NOT} \; operand \\ & \; | \; variable <- \texttt{CALL} \; function \\ & \; | \; variable <- \; operand \; \\ & \; | \; \texttt{arg} \; operand \; \\ & \; | \; \texttt{return} \; operand \; \\ & \; | \; \texttt{output} \; operand \; \\ & \; | \; \texttt{jump} \; label \\ & \; | \; \texttt{jump_if_0} \; operand \; label \\ \end{align*}

The bodies of functions, and the global block are replaced with vectors of IR instructions. The rest of the structure of the program is same with IR .So, an IR program still consists of function definitions, and a main part of the program. In general, having a simple IR allows us to reason about our optimizations in a simpler and modular fashion.

5.3.1 Example

let's look at an example. here is the original program:

def bar(int a) : int { return a + (a - 12) * 42; }

def foo(int a) : int {
    int x;
    int y;
    x := bar(a);
    y := bar(a+1);
    return x*y;
}

int x;
while (x < 2) { x := x + 1; }
x := foo(x);
output x+3;

here is the IR version:

    def bar(int a) {
        _tmp0 <- a - 12
        _tmp1 <- _tmp0 * 42
        _tmp2 <- a + _tmp1
        return _tmp2
    }

    def foo(int a) {
        arg a
        x <- CALL bar
        _tmp0 <- a + 1
        arg _tmp0
        y <- CALL bar
        _tmp1 <- x * y
        return _tmp1
    }

WHILE_START_0:
    _tmp0 <- x < 2
    jump_if_0 _tmp0 WHILE_END_0

    x <- x + 1
    jump WHILE_START_0

WHILE_END_0:
    arg x
    x <- CALL foo
    _tmp1 <- x + 3
    output _tmp1

Note that we are writing the labels on separate lines like instructions for convenience, but they are not actually instructions and do not take up a line of code.

We see that the IR is a lot like assembly, except that it's higher-level (target agnostic, keeps the abstraction of functions and function calls, arbitrary number of variables). It does strip away the variable definitions, type annotations, etc because we assume that these are already in the symbol table.

This IR is nicer to work with for optimizations because we do not have to deal with complex nested expressions and we can assume that every instruction is in one of a few kinds of formats and never has more than two operands.

How do we generate the IR from the AST? it works exactly like the codegen we have already examined, except emitting the IR instructions instead of assembly. other than that, we can use the same strategy and implementation. for clarity in this example i always created a new temporary, but we could also reuse them between instructions using the same strategy as we did for assembly code generation.

5.3.2 exercise

Translate the following program:

def foo(int a, int b) : int {
  int x;
  int y;

  x := a + b * (a - b);

  while (y <= x + 3) {
    y := y + 10;
  }

  return y * 2;
}

int t1;
int t2;

t1 := 42;
t2 := foo(t1 + 2, t1 * 2);
if (t1 < t2) { t1 := t2; }
else { t1 := t1 + 2; }

output t1 + 42;

Solution

    def foo(int a, int b) {
      _tmp0 <- a - b
      _tmp1 <- b * _tmp0
      x <- a + _tmp1

WHILE_START_0:
      _tmp2 <- x + 3
      _tmp3 <- y <= _tmp2
      jump_if_0 _tmp3 WHILE_END_0

      y <- y + 10
      jump WHILE_START_0

WHILE_END_0:
      _tmp4 <- y * 2
      return _tmp4
    }

    t1 <- 42
    _tmp0 <- t1 + 2
    _tmp1 <- t1 * 2
    arg _tmp0
    arg _tmp1
    t2 <- CALL foo

    _tmp2 <- t1 < t2
    jump_if_0 _tmp2 IF_FALSE_1
    t1 <- t2
    jump IF_END_1

IF_FALSE_1:
    t1 <- t1 + 2

IF_END_1:
    _tmp3 <- t1 + 42
    output _tmp3

5.4 Control-flow graphs (also CFG)

Unlike codegen, we do not output the IR as a linear list of instructions. instead, we use a data structure called the control flow graph. Confusingly, we abbreviate this as CFG, just like context-free grammars. so in the frontend CFG will mean context-free grammar and in the middle-end it will mean control flow graph; you just have to use the surrounding context to figure out which one is meant.

A CFG models the control-flow of the program, i.e., which instructions can execute after which other instructions. it is a directed graph whose nodes are basic blocks (a sequence of instructions, to be defined more precisely soon). an edge from basic block A to basic block B means that immediately after A executes, B may execute (not necessarily must execute, just may execute).

A basic block is a sequence of instructions with the following property: control must enter at the first instruction of the block and can only exit after the last instruction in the block, i.e., the block has a single entry point and a single exit point. in other words, if program execution enters the basic block then every instruction in the basic block must be executed. When we were talking about optimization scope and said that local optimizations operate on "small fragments of code", what we really meant was basic blocks.

5.4.1 Examples

the CFG is perhaps more easily understood by example.

Example 1

program:

x := foo();
if (x < 10) { x := 3; y := 2; } else { x := 2; y := 3; }
z := x + y * 2;
output z;

IR (here we label the basic blocks with letters [A], [B], …):

    [A]
    x <- CALL foo
    _tmp0 <- x < 10
    jump_if_0 _tmp0 IF_FALSE_0

    [B]
    x <- 3
    y <- 2
    jump IF_END_0

    [C]
IF_FALSE_0:
    x <- 2
    y <- 3

    [D]
IF_END_0:
    _tmp1 <- y * 2
    z <- x + _tmp1
    output z

Each separate block of instructions is a basic block. Note that:

  • a jump, return, or output always ends a basic block.
  • a basic block doesn't have to end in a jump, return, or output (see false branch).
    • the block has to end there because the next instruction is the target of a jump and we cannot allow basic blocks to be entered in the middle. since the way we're generating the IR every jump target has to be a label, it's fair to say that any labeled instruction will automatically start a new basic block.
  • a call does not end a basic block. while it does transfer control somewhere else, eventually that control will always return to immediately after the call. That means we preserve the invariant that every instruction in the basic block must be executed.

CFG:

cfg1.png

Example 2:

[cfg2] program:

x := foo();
while (x < 10) {
  x := x + 1;
  y := y + 1;
  z := x * y;
}
output z;

IR:

    [A]
    x <- CALL foo

    [B]
WHILE_START_0:
    _tmp0 <- x < 10
    jump_if_0 _tmp0 WHILE_END_0

    [C]
    x <- x + 1
    y <- y + 1
    z <- x * y
    jump WHILE_START_0

    [D]
WHILE_END_0:
    output z

CFG:

cfg2.png

Example 3:

program:

x := foo();
while (x < 10) {
  x := x + 1;
  if (x <= y) {
    while (x < y) {
      y := y - 2;
    }
  }
}
output y;

IR:

    [A]
    x <- CALL foo

    [B]
WHILE_START_0:
    _tmp0 <- x < 10
    jump_if_0 _tmp0 WHILE_END_0

    [C]
    x <- x + 1
    _tmp1 <- x <= y
    jump_if_0 _tmp1 IF_FALSE_1

    [D]
WHILE_START_2:
    _tmp2 <- x < y
    jump_if_0 _tmp2 WHILE_END_2

    [E]
    y <- y - 2
    jump WHILE_START_2

    [F]
WHILE_END_2:
IF_FALSE_1:
    jump WHILE_START_0

    [G]
WHILE_END_0:
    output y

CFG:

cfg3.png

5.4.2 Building a CFG

The basic algorithm for building a CFG is simple, though various language features can complicate it a bit. Let's assume that we start with a linear sequence of IR instructions (something like what we would get from our codegen algorithm, a vector of instructions, but IR instructions rather than assembly instructions). Also, this is per function (including treating the "global" block of code ending in 'output' as a function itself).

[cfg4]

We will use the following example to illustrate the algorithm:

00  x <- CALL foo
01  WHILE_START_0: _tmp0 <- x < 10
02  jump_if_0 _tmp0 WHILE_END_0
03  x <- x + 1
04  _tmp1 <- x <= y
05  jump_if_0 _tmp1 IF_FALSE_1
06  WHILE_START_2: _tmp2 <- x < y
07  jump_if_0 _tmp2 WHILE_END_2
08  y <- y - 2
09  jump WHILE_START_2
10  WHILE_END_2: IF_FALSE_1: jump WHILE_START_0
11  WHILE_END_0: output y
  1. Identify all "leader" instructions (the instructions that start a basic block). A leader is the first instruction in the function, any labeled instruction (because it can be the target of a jump), and any instruction immediately following a conditional jump (because it is the fall-through instruction if the conditional jump is not taken). Remember to treat multiple consecutive labels as a single label for these purposes. Save the indices of these leader instructions in a vector, in the order they appear in the instruction sequence.

    ==> the leaders for the example are indices { 00, 01, 03, 06, 08, 10, 11 }.

  2. Record the basic blocks as the sequences of instructions from each leader up to, but not including, the next leader in sequence.

    The basic blocks for the example are:

    [A]
    00 x <- CALL foo
    
    [B]
    01 WHILE_START_0: _tmp0 <- x < 10
    02 jump_if_0 _tmp0 WHILE_END_0
    
    [C]
    03 x <- x + 1
    04 _tmp1 <- x <= y
    05 jump_if_0 _tmp1 IF_FALSE_1
    
    [D]
    06 WHILE_START_2: _tmp2 <- x < y
    07 jump_if_0 _tmp2 WHILE_END_2
    
    [E]
    08 y <- y - 2
    09 jump WHILE_START_2
    
    [F]
    10 WHILE_END_2: IF_FALSE_1: jump WHILE_START_0
    
    [G]
    11 WHILE_END_0: output y
    
  3. For each basic block, look at the last instruction:
    1. if it is not a jump, add an edge from this basic block to the basic block whose leader is immediately next in sequence.

      ==> basic block A has an edge to B; basic block G would have an edge but it ends the function so it doesn't.

    2. if it is an unconditional jump to label L, add an edge from this basic block to the basic block whose leader is labeled L.

      ==> for the example, basic block E has an edge to D; basic block F has an edge to B.

    3. if it is a conditional jump to label L:
      1. add an edge from this basic block to the basic block whose leader is immediately next in sequence.
      2. add an edge from this basic block to the basic block whose leader is labeled L.

==> for the example, basic block B has edges to C and G; basic block C has edges to D and F; basic block D has edges to E and F.

5.4.3 Exercise

[cfg5]

Create the CFG for the following program:

def foo(int a, int b) {
  _tmp0 <- a - b
  _tmp1 <- b * _tmp0
  x <- a + _tmp1
  WHILE_START_0:
  _tmp2 <- x + 3
  _tmp3 <- y <= _tmp2
  jump_if_0 _tmp3 WHILE_END_0
  y <- y + 10
  jump WHILE_START_0
  WHILE_END_0:
  _tmp4 <- y * 2
  return _tmp4
}

t1 <- 42
_tmp0 <- t1 + 2
_tmp1 <- t1 * 2
arg _tmp0
arg _tmp1
t2 <- CALL foo
_tmp2 <- t1 < t2
jump_if_0 _tmp2 IF_FALSE_1
t1 <- t2
jump IF_END_1
IF_FALSE_1:
t1 <- t1 + 2
IF_END_1:
_tmp3 <- t1 + 42
output _tmp3

Solution

basic blocks:

def foo(int a, int b) {
  [A]
  _tmp0 <- a - b
  _tmp1 <- b * _tmp0
  x <- a + _tmp1

  [B]
  WHILE_START_0:
  _tmp2 <- x + 3
  _tmp3 <- y <= _tmp2
  jump_if_0 _tmp3 WHILE_END_0

  [C]
  y <- y + 10
  jump WHILE_START_0

  [D]
  WHILE_END_0:
  _tmp4 <- y * 2
  return _tmp4
}

[E]
t1 <- 42
_tmp0 <- t1 + 2
_tmp1 <- t1 * 2
arg _tmp0
arg _tmp1
t2 <- CALL foo
_tmp2 <- t1 < t2
jump_if_0 _tmp2 IF_FALSE_1

[F]
t1 <- t2
jump IF_END_1

[G]
IF_FALSE_1:
t1 <- t1 + 2

[H]
IF_END_1:
_tmp3 <- t1 + 42
output _tmp3

cfg:

exercise-cfg.png

5.5 Local optimizations

Local optimizations operate on basic blocks; we iterate through each block (in arbitrary order) and apply the local optimizations to optimize each block in isolation from each other. We have seen some of these optimizations as examples in the beginning of the optimization section.

5.5.1 Constant Folding and Arithmetic Identities

We have already mentioned these; they are simple optimizations that operate on single instructions. The idea is that sometimes we do not need to wait for the program to execute in order to evaluate expressions, we can do it right in the compiler.

For each instruction:

  • if all the operands are constants, evaluate the operation
  • if some of the operands are constant, attempt to apply an arithmetic identity:

    x * 1 = x
    1 * x = x
    x * 0 = 0
    0 * x = 0
    x + 0 = x
    0 + x = x
    x - 0 = 0
    0 - x = -x
    x - x = 0
    

We can also apply logical identities under our assumption that 0 is false and non-0 is true:

0 ∧ x = 0
x ∧ 0 = 0
non-0 ∧ x = x
x ∧ non-0 = x
0 ∨ x = x
x ∨ 0 = x
non-0 ∨ x = 1
x ∨ non-0 = 1

We can also handle some identities for the relational operators that probably would not happen but are easy enough to check for that we might as well. These identities may also show up after the result of other optimizations:

(x < x) = 0
(x ≤ x) = 1
(x = x) = 1

5.5.2 Local value numbering

This is a slightly more complicated optimization that operates on the entire basic block. Our goal is to identify redundant expressions computed by the basic block so that we do not compute the same expression more than once.

Example 1:

a <- 4
b <- 5
c <- a + b
d <- 5
e <- a + b

The second 'a + b' is redundant and we could optimize this as:

a <- 4
b <- 5
c <- a + b
d <- 5
e <- c

A naive approach would be to simply scan down the list of instructions in the basic block and record which ones we see; if we see it again, replace it with the left-hand side of the first instruction to compute that expression. This approach would work for the example above, but suppose we have instead:

Example 2:

a <- 4
b <- 5
c <- a + b
a <- 5
e <- a + b

Now, the optimization would be incorrect, because the value of the second a + b is different than the value of the first a + b. Here is another problem:

Example 3:

a <- 4
b <- 5
c <- a + b
d <- 5
e <- b + a

The second expression b + a is syntactically different than the first a + b but is semantically identical because addition is commutative. We need to be able to recognize that fact (while still distinguishing non-commutative expressions like a - b and b - a).

We need a scheme that can solve both problems. Our strategy will be to tag expressions with numbers s.t. semantically identical expressions (like a + b and b + a in example 3) are given the same number and hence are recognized as equivalent while semantically different expressions (like the first a + b and the second a + b in example 2) are given different numbers.

The idea is that we want to distinguish between the variables (which are just arbitrary names for unknown values) and the values of those variables. We are going to label variables to capture different valuations they may have during their lifetime.

Note that our goal is not to guarantee that we find all equivalent expressions; this is uncomputable in general. We want to find as many as we can while still being fast and while guaranteeing that we never incorrectly find non-equivalent expressions to be equivalent. The trade-off between precision (or efficacy) and tractability is a common one that shows up in program analysis a lot, since determining semantic properties of programs precisely is undecidable.

The algorithm:

  1. initialize a table mapping expressions to value numbers (starting empty). we often use a hash table, and thus this optimization is sometimes called "hash-based value numbering".
  2. for each copy, unary, and binary operator instruction in order:
    1. look up each variable operand to get its value number; assign a fresh number if it doesn't have one already and record it in the table.
    2. if the operator is commutative, order the operand value numbers from least to greatest.
    3. look up the entire expression (using the operand numbers instead of variables) in the table to get its number; assign a fresh number if it doesn't have one already and record it in the table.
    4. update the table to map the left-hand side variable to the number for the right-hand side expression.
  3. for any two instructions '<var1> <- <exp1>' and '<var2> <- <exp2>' in order s.t. <var1> and <var2> have the same value number at the '<var2> <- <exp2>' instruction, we can replace <exp2> with <var1>.

Example 1 redux

a <- 4      // 4 = #1, a = #1
b <- 5      // 5 = #2, b = #2
c <- a + b  // (#1 + #2) = #3, c = #3
d <- 5      // 5 = #2, d = #2
e <- a + b  // (#1 + #2) = #3, e = #3

we see that c = #3 and e = #3, thus we can replace the last instruction with 'e <- c'.

Example 2 redux

a <- 4      // 4 = #1, a = #1
b <- 5      // 5 = #2, b = #2
c <- a + b  // (#1 + #2) = #3, c = #3
a <- 5      // 5 = #2, a = #2
e <- a + b  // (#2 + #2) = #4, e = #4

here we see that c and e have different tags, thus they are not guaranteed to be equivalent and we cannot apply the transformation.

Example 3 redux

a <- 4      // 4 = #1, a = #1
b <- 5      // 5 = #2, b = #2
c <- a + b  // (#1 + #2) = #3, c = #3
d <- 5      // 5 = #2, d = #2
e <- b + a  // (#1 + #2) = #3, e = #3

because we reordered the value numbers for + expressions, we still recognize that c and e are equivalent in this case.

Example 4

x <- a + d  // a = #1, d = #2, (#1 + #2) = #3
y <- a      // y = #1
z <- y + d  // (#1 + #2) = #3, z = #3

we can see from this example that local value numbering can discover equivalent expressions even if they use different variables altogether.

Let's look at another example where the same expression appears in multiple places:

Example 5

a <- x + y  // x = #1, y = #2, (#1 + #2) = #3, a = #3
b <- x + y  // (#1 + #2) = #3, b = #3
a <- 1      // a = #4
c <- x + y  // (#1 + #2) = #3, c = #3
b <- 2      // b = #5
d <- x + y  // (#1 + #2) = #3, d = #3

How can we transform this code to optimize it based on the value numbering? We know that x + y in all four expressions will have the same value, but a and b are redefined at various points. So, it would be incorrect to replace c <- x + y with c <- a (However, rewriting it as c <- b is fine). Similarly, it would be incorrect to replace d <- x + y with either d <- a or d <- b.

This is why we have the caveat in the algorithm that we can only replace <exp2> with <var1> if <var1> and <var2> have the same value number at that instruction. For the instruction c <- x + y we see that while a originally also had value number #3, now it has value number #4 and so cannot be used. similarly, for the instruction d <- x + y we see that neither a nor b still have value number #3 and so neither can be used.

It is annoying that we have to recompute x + y in the last instruction just because we have overwritten a and b already. We could get around this problem by renaming variables whenever they are assigned to keep them unique:

a0 <- x0 + y0  // x0 = #1, y0 = #2, (#1 + #2) = #3, a0 = #3
b0 <- x0 + y0  // (#1 + #2) = #3, b0 = #3
a1 <- 1        // a1 = #4
c0 <- x0 + y0  // (#1 + #2) = #3, c0 = #3
b1 <- 2        // b1 = #5
d0 <- x0 + y0  // (#1 + #2) = #3, d0 = #3

now we can replace d0 <- x0 + y0 with either d0 <- a0 or d0 <- b0 with no problem. there is an issue though: renaming in this way becomes much more complicated with we look at the entire function, particularly when we have multiple branches that each redefine the same variables. We would need a way to merge the values from different branches. This leads to a form of IR called "static single assignment form" (SSA), which is a popular variant of IR in modern compilers (both gcc and llvm use it). We may talk about SSA later, but for now we will stick to the original IR we have already defined.

5.5.3 Exercise

optimize the following code using local value numbering:

a <- b - c
d <- c - b
e <- c
f <- a AND d
g <- b - e
h <- d AND a
d <- 42
i <- e - b

Solution:

value-numbered code:

a <- b - c    // b = #1, c = #2, (#1 - #2) = #3, a = #3
d <- c - b    // (#2 - #1) = #4, d = #4
e <- c        // e = #2
f <- a AND d  // (#3 AND #4) = #5, f = #5
g <- b - e    // (#1 - #2) = #3, g = #3
h <- d AND a  // (#3 AND #4) = #5, h = #5
d <- 42       // 42 = #6, d = #6
i <- e - b    // (#2 - #1) = #4, i = #4

transformed code:

a <- b - c
d <- c - b
e <- c
f <- a AND d
g <- a
h <- f
d <- 42
i <- e - b

5.6 Data Flow Analysis

We have looked at analysis and optimization at the basic block level (local optimization), but the limited scope means that we can miss many optimization opportunities. Let's widen our scope to entire functions, i.e., global optimization.

The analyses required for global optimization tend to be more complicated than for local optimization, and programming language researchers have developed a specific theoretical framework for it called "data flow analysis" (DFA for short). Compiler developers tried a lot of ad-hoc global analyses at first, but found that they were very difficult to get correct; data flow analysis provides a handy set of tools such that, if you use them correctly, you will get a correct analysis where you can prove correctness mathematically.

We will first introduce some fundamental ideas that will be important for understanding how DFA works at a high level. Then we will look at some specific examples of analyses and the optimizations they enable. finally, we will generalize to the DFA framework and discuss its properties.

5.6.1 Motivation

When we are analyzing a function, we are trying to compute invariants about its behavior when it executes: What facts are always true no matter what arguments we give the function?

Here is a simple example:

Example 1

def foo(int a, int b) {
    int c;
    if (a < b) { c := b - a; }
    else { c := a - b; }
    return c;
}

Given this function, we don't know what the values of the arguments will be, nor whether we will take the true or false branch of the conditional. But we can infer the invariant that the value returned by this function will always be non-negative. here is our reasoning:

  • either a < b or b < a or a = b.
  • if a < b then we take the true branch, c is positive
  • if b < a then we take the false branch, c is positive
  • if a = b then we take the false branch, c is zero
  • therefore c cannot be negative at the return statement

In this example we are reasoning about the sign of c (positive, negative, or zero), so the facts that we care about pertain to that. In other analyses we care about other properties (as we will see examples of soon). Data flow analysis provides a general framework that is independent of the particular facts we want to reason about; basically, it is a template into which we plug in different kinds of properties we keep track of depending on the specific analysis we are trying to perform.

How do we perform our reasoning over the facts that we care about? In the example we reasoned about each possible path that execution could take through the function (i.e., it could take the true branch or the false branch, then either way it will reach the return statement). This is easy if there's only one or a few possible paths, but in general this won't work.

Example 2

A
if (...) { B } { C }
if (...) { D } { E }

possible paths in the program:

  • A → B → D
  • A → B → E
  • A → C → D
  • A → C → E

There are 22 paths

Example 3

A
if (...) { B } { C }
if (...) { D } { E }
if (...) { F } { G }

possible paths:

  • A → B → D → F
  • A → B → D → G
  • A → B → E → F
  • A → B → E → G
  • A → C → D → F
  • A → C → D → G
  • A → C → E → F
  • A → C → E → G

There are 23 paths

In general, for n conditionals there are 2n paths: an exponential blowup. Things are worse for loops:

Example 4

A
while (...) { B }
C

possible paths:

  • A → C
  • A → B → C
  • A → B → B → C
  • A → B → B → B → C

there are an arbitrary, possibly infinite number of paths.

Somehow, we need to reason about the function's behavior without reasoning about each path it could take. However, remember that in order to use the analysis results to optimize the function, we need to be safe: if we say that something is true about the function, it must really be true or the optimization could break the program. So the analysis must never return a false invariant (i.e., a fact that isn't true for some path in the function), but it can't reason about each path independently.

DFA handles this conundrum by computing a conservative over-approximation of program behavior. What this means is that we don't promise to compute all true facts, but we do promise that any facts we compute are true. So, the result of the analysis covers all possible executions of the program.

Example 5

def foo(int a, int b) {
    int c;
    if (a < b) { c := b - a; }
    if (b < a) { c := a - b; }
    if (a = b) { c := 1; }
    return c;
}

For this example we can reason about signedness again and determine that c must always be positive at the return statement. However, our analysis may only return that c will be non-negative. This is a true invariant, but it is not as precise as it could be (being positive implies being non-negative). We will allow our analyses to be less precise in order to make them tractable. being less precise means that the information may not be as useful for optimization (i.e., we may miss optimization opportunities), but because the information is always true we are guaranteed that the optimizations will not break the program.

5.6.2 DFA fundamentals

Let's look at the general DFA framework. We will examine it at a high level here with a simple example, then look at some more examples of common compiler analyses and optimizations, then look at the details of the DFA framework after you have that context. When we go over the high level below it will be very abstract and probably hard to understand. I recommend going through it, then going through the examples, then returning to the high level explanation again once you have the examples as context.

We will let L be a set of dataflow facts, whatever they are (for example, variable signedness in the motivation section above). These will be the possible answers that we compute for the analysis.

For each basic block k in the CFG, we will define INk and OUTk to be the dataflow facts that hold immediately before the basic block is executed and immediately afterwards, respectively. so INk and OUTk ∈ L.

For each basic block, we will compute OUT (the output set of data flow facts) from IN (the input set of dataflow facts) based on the IR instructions in the basic block. However, the IR instructions operate on values (e.g., integers) while the analysis operates on data flow facts. so we need to redefine the IR instructions in terms of how they operate on the dataflow facts. since this depends on what data flow facts we are tracking for a given analysis, these "abstract transfer functions" are another thing we plug into the DFA framework along with L.

  • let Fk be the abstract transfer function for basic block k
  • then OUTk = Fk(INk)

Of course, we need to know INk for each basic block. If the basic block only has one predecessor, then that's easy: If the predecessor is block p (for predecessor), then INk = OUTp. [show picture]

But what if block k has two predecessors, say p1 and p2? Then INk has to be a safe approximation of both of their outputs. In other words, INk can only contain facts that are true of both OUTp1 and OUTp2. how do we compute this approximation? We use an operator called the "meet" operator ⊓.

IN_k = OUT_p if p is the only predecessor of k
     = OUT_p1 ⊓ OUT_p2 if p1 and p2 are the predecessors of k

For simplicity, we will say this as: INk = ⊓ OUTp s.t. p is a predecessor of k

the definition of ⊓ is also specific to L, and so it is another thing that we plug into the DFA framework along with L and the abstract transfer functions.

Now we have a set of simultaneous equations, i.e., for each basic block k:

IN_k = ⊓ OUT_p s.t. p is a predecessor of k
OUT_k = F_k(IN_k)

Note that because of loops, these equations may be recursive, that is, INk may transitively depend on OUTk (we will see an example shortly).

The actual analysis consists of solving these equations to compute values for INk and OUTk for every basic block k; these will be the solutions of the analysis. how do we solve them? The idea is that we want the least fixpoint of these equations, and that will be our analysis answer.

A fixpoint of a function is an input x s.t. \( f(x) = x \) ; i.e., an input s.t. applying the function to the input yields itself. a function in general may have any number of fixpoints, from 0 to infinity. let's see some examples:

f(x) = x + 2  [0 fixpoints]
f(x) = x² - x [2 fixpoints]
f(x) = x² ÷ x [∞ fixpoints]

If there are multiple fixpoints then we want the most precise one (the least fixpoint). For example, in the case above where there are two fixpoints (0 and 1), the least fixpoint is 0. even if there are multiple fixpoints, there may not be a least fixpoint. for example, in the case above where there are infinite fixpoints there is no least fixpoint (assuming we are operating over integers rather than natural numbers).

Fortunately the details of the DFA framework (to be discussed later) will guarantee that a least fixpoint always exists and is computable. so how do we compute it?

The traditional algorithm is a worklist algorithm. We initialize INk and OUTk for every basic block to an initial value, then initialize a worklist to contain all of the basic blocks. then:

while the worklist is not empty:
  remove a basic block from the worklist, call it k
  compute a new IN_k and OUT_k
  if OUT_k changed:
    put any basic block m s.t. IN_m directly depends on OUT_k into the worklist

The details of the DFA framework (again, to be discussed later) guarantee that this algorithm will eventually terminate and that INk and OUTk will be the least fixpoint solution for every basic block.

  1. signedness DFA example

    A signedness DFA analysis (we could make this more precise in various ways, e.g., being able to distinguish non-negative and non-positive, but we will keep it very simple for the purposes of this example):

    Sign = { +, 0, -, Uninit, Any }
    L = Var -> Sign
    
    ⊓ = (x ∈ Sign) . Any ⊓ x = x ⊓ Any = Any
        (x ∈ Sign) . Uninit ⊓ x = x ⊓ Uninit = x
        (x ∈ {+, 0, -}) . x ⊓ y where x != y = Any
        (x ∈ {+, 0, -}) . x ⊓ x = x
    

    Abstract transfer functions:

    var1 <- constant       : update var1 to the signedness of the constant (i.e., +, 0, or -)
    var1 <- var2           : update var1 to the signedness of var2
    var1 <- var2 + var3    : update var1 to var2 #ADD var3, where #ADD is defined below
    var1 <- var2 - var3    : update var1 to var2 #SUB var3, where #SUB is defined below
    var1 <- var2 < var3    : update var1 to var2 #LT var3, where #LT is defined below
    var1 <- CALL function  : update var1 to Any
    

    Here are the tables for the abstract binary operations: (first column is lhs, first row is rhs, the cells are the results. For example, '0' #SUB '+' = '-')

    For #ADD:

    #ADD + 0 - Uninit Any
    + + + Any Uninit Any
    0 + 0 - Uninit Any
    - Any - - Uninit Any
    Uninit Uninit Uninit Uninit Uninit Uninit
    Any Any Any Any Uninit Any

    For #SUB:

    #SUB + 0 - Uninit Any
    + Any + + Uninit Any
    0 - 0 + Uninit Any
    - - - Any Uninit Any
    Uninit Uninit Uninit Uninit Uninit Uninit
    Any Any Any Any Uninit Any

    For #LT:

    #LT + "0" - Uninit Any
    + Any 0 0 Uninit Any
    0 + 0 0 Uninit Any
    - + + Any Uninit Any
    Uninit Uninit Uninit Uninit Uninit Uninit
    Any Any Any Any Uninit Any

    Source code:

    def foo(int a) : int {
      int x;
      int y;
    
      x := 1;
      y := -1;
    
      if (y < a) {
        while (a < x) {
          x := x - 1;
          y := y - 1;
        }
      }
      else {
        x := x + 1;
        y := -1;
      }
    
      return y;
    }
    

    IR:

    def foo(int a) {
      [B0]
      x <- 1
      y <- -1
      _tmp0 <- y < a
      jump_if_0 _tmp0 IF_FALSE_0
    
      [B1] WHILE_START_1:
      _tmp1 <- a < x
      jump_if_0 _tmp1 WHILE_END_1
    
      [B2]
      x <- x - 1
      y <- y - 1
      jump WHILE_START_1
    
      [B3] WHILE_END_1:
      jump IF_END_0
    
      [B4] IF_FALSE_0:
      x <- x + 1
      y <- -1
    
      [B5] IF_END_0:
      return y
    }
    

    Analysis:

    dfa-example-cfg.png

    Initialization:

    • IN0: [a = Any, x = 0, y = 0, _tmp0 = Uninit, _tmp1 = Uninit]
    • OUT0: [a = Uninit, x = Uninit, y = Uninit, _tmp0 = Uninit, _tmp1 = Uninit]
    • IN1: [a = Uninit, x = Uninit, y = Uninit, _tmp0 = Uninit, _tmp1 = Uninit]
    • OUT1: [a = Uninit, x = Uninit, y = Uninit, _tmp0 = Uninit, _tmp1 = Uninit]
    • IN2: [a = Uninit, x = Uninit, y = Uninit, _tmp0 = Uninit, _tmp1 = Uninit]
    • OUT2: [a = Uninit, x = Uninit, y = Uninit, _tmp0 = Uninit, _tmp1 = Uninit]
    • IN3: [a = Uninit, x = Uninit, y = Uninit, _tmp0 = Uninit, _tmp1 = Uninit]
    • OUT3: [a = Uninit, x = Uninit, y = Uninit, _tmp0 = Uninit, _tmp1 = Uninit]
    • IN4: [a = Uninit, x = Uninit, y = Uninit, _tmp0 = Uninit, _tmp1 = Uninit]
    • OUT4: [a = Uninit, x = Uninit, y = Uninit, _tmp0 = Uninit, _tmp1 = Uninit]
    • IN5: [a = Uninit, x = Uninit, y = Uninit, _tmp0 = Uninit, _tmp1 = Uninit]
    • OUT5: [a = Uninit, x = Uninit, y = Uninit, _tmp0 = Uninit, _tmp1 = Uninit]

    Worklist order: 0, 4, 5, 1, 3, 2, 1, 2, 3, 5

    In lecture, I think I just initialized the worklist with the first basic block. This works for this analysis and language, but in general you need to initialize the worklist with all basic blocks.

    Final answer:

    • IN0: [a = Any, x = 0, y = 0, _tmp0 = Uninit, _tmp1 = Uninit]
    • OUT0: [a = Any, x = +, y = -, _tmp0 = Any, _tmp1 = Uninit]
    • IN1: [a = Any, x = Any, y = -, _tmp0 = Any, _tmp1 = Any]
    • OUT1: [a = Any, x = Any, y = -, _tmp0 = Any, _tmp1 = Any]
    • IN2: [a = Any, x = Any, y = -, _tmp0 = Any, _tmp1 = Any]
    • OUT2: [a = Any, x = Any, y = -, _tmp0 = Any, _tmp1 = Any]
    • IN3: [a = Any, x = Any, y = -, _tmp0 = Any, _tmp1 = Any]
    • OUT3: [a = Any, x = Any, y = -, _tmp0 = Any, _tmp1 = Any]
    • IN4: [a = Any, x = +, y = -, _tmp0 = Any, _tmp1 = Uninit]
    • OUT4: [a = Any, x = +, y = -, _tmp0 = Any, _tmp1 = Uninit]
    • IN5: [a = Any, x = Any, y = -, _tmp0 = Any, _tmp1 = Any]
    • OUT5: [a = Any, x = Any, y = -, _tmp0 = Any, _tmp1 = Any]
  2. Exercise

    Perform signedness DFA on the following program:

    def foo(int a) : int {
      int x;
      int y;
    
      x := 1;
      y := -1;
    
      while (x < a) {
        if (y < a) {
          x := x + 1;
          y := y - 1;
        }
        else {
          x := 1;
          y := -1;
        }
      }
    
      return y;
    }
    

    IR:

    def foo(int a) {
      [B0]
      x <- 1
      y <- -1
    
      [B1] WHILE_START_0:
      _tmp0 <- x < a
      jump_if_0 _tmp0 WHILE_END_0
    
      [B2]
      _tmp1 <- y < a
      jump_if_0 _tmp1 IF_FALSE_1
    
      [B3]
      x <- x + 1
      y <- y - 1
      jump IF_END_1
    
      [B4] IF_FALSE_1
      x <- 1
      y <- -1
    
      [B5] IF_END_1:
      jump WHILE_START_0
    
      [B6] WHILE_END_0:
      return y
    }
    

    CFG:

    dfa-exercise-cfg.png

    Analysis:

    • IN0: [a = Any, x = 0, y = 0, _tmp0 = Uninit, _tmp1 = Uninit]
    • OUT0: [a = Any, x = +, y = -, _tmp0 = Uninit, _tmp1 = Uninit]
    • IN1: [a = Any, x = +, y = -, _tmp0 = Any, _tmp1 = Any]
    • OUT1: [a = Any, x = +, y = -, _tmp0 = Any, _tmp1 = Any]
    • IN2: [a = Any, x = +, y = -, _tmp0 = Any, _tmp1 = Any]
    • OUT2: [a = Any, x = +, y = -, _tmp0 = Any, _tmp1 = Any]
    • IN3: [a = Any, x = +, y = -, _tmp0 = Any, _tmp1 = Any]
    • OUT3: [a = Any, x = +, y = -, _tmp0 = Any, _tmp1 = Any]
    • IN4: [a = Any, x = +, y = -, _tmp0 = Any, _tmp1 = Any]
    • OUT4: [a = Any, x = +, y = -, _tmp0 = Any, _tmp1 = Any]
    • IN5: [a = Any, x = +, y = -, _tmp0 = Any, _tmp1 = Any]
    • OUT5: [a = Any, x = +, y = -, _tmp0 = Any, _tmp1 = Any]
    • IN6: [a = Any, x = +, y = -, _tmp0 = Any, _tmp1 = Any]
    • OUT6: [a = Any, x = +, y = -, _tmp0 = Any, _tmp1 = Any]

5.6.3 Example: Available Expressions -> GCSE

A common compiler optimization is common subexpression elimination, i.e., trying to avoid recomputing expressions multiple times. This is the global version of the optimization we covered in the local optimization section.

Example

def foo(int a) {
    int x;
    int y;
    int z;

    x := 4 * a;

    if (0 < a) {
        a := a + 1;
        y := 4 * a;
    }

    z := 4 * a;
    return z;
}

Note that the expression 4 * a has been computed before we get to the assignment of z no matter whether we enter the true branch of the conditional or not. Also note that it has a different value depending on whether we entered the true branch or not.

Optimized program:

def foo(int a) {
    int x;
    int y;
    int z;
    int _tmp0;

    x := 4 * a;
    _tmp0 := x;

    if (0 < a) {
        a := a + 1;
        y := 4 * a;
        _tmp0 := y;
    }

    z := _tmp0;
    return z;
}

We have replaced the expression computation with a set of copies. We can break this process into two components:

  • DFA: available expressions analysis
  • optimization: GCSE
  1. Available expressions

    Definition: an expression exp is available at some program point pp if:

    • exp is computed along every path from the entry of the function to pp
    • the variables used in exp are not redefined after the last evaluation of exp along any path

    In order to define a DFA to compute available expressions, we need to define:

    • L, the dataflow facts we are computing
    • ⊓, the meet operator for L
    • the abstract transfer functions for L

    The facts we are computing are which expressions are available, so the elements of L will be sets of expressions. We interpret an element of L as "these are the expressions that are currently available".

    The meet operator tells us how to conservatively approximate two elements of L using a single element of L. Given two sets of expressions E1 and E2, we know that along one path the E1 expressions are available and along another path the E2 expressions are available. The only safe approximation of those two sets is E1 ∩ E2, i.e., those expressions that are available along both paths.

    What is the weakest thing we can say about what expressions are available? i.e., the safest thing that is guaranteed not to give incorrect information? the empty set: there are no available expressions.

    What is the strongest thing we can say about what expressions are available? the set of all expressions contained in the function.

    We can define a single abstract transfer function and specialize it per basic block:

    \[ OUT_k = GEN_k \cup (IN_k - KILL_k) \]

    GENk is defined as the set of expressions that are computed in basic block k (whose variables are not defined afterwards in that basic block).

    KILLk is defined as the set of expressions for which at least one variable is defined in basic block k.

    Let's look at an example:

    [B0]
    a <- b + c
    b <- a + c
    d <- b + d
    
    • GEN0 = { a+c }
    • KILL0 = any expression that uses 'a', 'b', or 'd'

    Note that B0 does not generate b+c or b+d because it defines b and d after those expressions are evaluated.

    So the formula \( OUT_k = GEN_k \cup (IN_k - KILL_k) \) says:

    • take all the expressions that are available coming into this block (i.e., INk)
    • remove those expressions that are killed by this block (i.e., subtract KILLk)
    • add back those expressions that are generated by this block (i.e., union GENk)

    Note that the order is important: if we union GENk and then subtract KILLk we may lose precision because we will miss some expressions that are available. See the 'a+c' example above.

    Quick exercise

    a <- b + c
    x <- y - z
    b <- x + y
    x <- a * b
    

    What are the GEN and KILL sets?

    Solution

    GEN = {y-z, a*b}

    KILL = {b+c, x+y} plus any expressions that contain a, x, b, x

    Now we have a complete available expressions DFA and we can use the DFA framework:

    1. Define the set of all expressions computed in the function; call this set E.
    2. For each basic block k, define GENk and KILLk by looking at the instructions in the basic block.
    3. Initialize IN0 to {}, i.e., there are no available expressions at the beginning of the function.
    4. Optimistically initialize all other IN and OUT sets to E (this is safe, because the analysis will remove expressions as necessary).
    5. Put all basic blocks onto the worklist.
    6. Apply the worklist algorithm until it is empty.
    1. Example

      Source code

      def foo(int a, int b) : int { // returns a^b
        int x;
        int y;
        int z;
      
        x := 1;
        y := a;
      
        while (x != b) {
          z := x * 2;
          if (z <= b) {
            y := y * y;
            x := x * 2;
          }
          else {
            y := y * a;
            x := x + 1;
          }
        }
      
        return y;
      }
      

      IR:

      def foo(int a, int b) {
        [B0]
        x <- 1
        y <- a
      
        [B1] WHILE_START_0:
        _tmp0 <- x != b
        jump_if_0 _tmp0 WHILE_END_0
      
        [B2]
        z <- x * 2
        _tmp1 <- z <= b
        jump_if_0 _tmp1 IF_FALSE_1
      
        [B3]
        y <- y * y
        x <- x * 2
        jump IF_END_1
      
        [B4] IF_FALSE_1:
        y <- y * a
        x <- x + 1
      
        [B5] IF_END_1:
        jump WHILE_START_0
      
        [B6] WHILE_END_0:
        return y
      }
      
      avail-expr-example-cfg.png
      E = { x!=b, x*2, z<=b, y*y, y*a, x+1 }
      
       GEN_0 = {}
      KILL_0 = { x!=b, x*2, y*y, y*a, x+1 }
      
       GEN_1 = { x!=b }
      KILL_1 = {}
      
       GEN_2 = { x*2, z<=b }
      KILL_2 = { z<=b }
      
       GEN_3 = {}
      KILL_3 = { x!=b, x*2, y*y, y*a, x+1 }
      
       GEN_4 = {}
      KILL_4 = { x!=b, x*2, y*y, y*a, x+1 }
      
       GEN_5 = {}
      KILL_5 = {}
      
       GEN_6 = {}
      KILL_6 = {}
      

      Now we can initialize the IN and OUT sets for each basic block, insert all basic blocks into the worklist, and apply the worklist algorithm. the final solution is:

       IN_0 = {}
      OUT_0 = {}
      
       IN_1 = {}
      OUT_1 = { x!=b }
      
       IN_2 = { x!=b }
      OUT_2 = { x!=b, x*2, z<=b }
      
       IN_3 = { x!=b, x*2, z<=b }
      OUT_3 = { z<=b }
      
       IN_4 = { x!=b, x*2, z<=b }
      OUT_4 = { z<=b }
      
       IN_5 = { z<=b }
      OUT_5 = { z<=b }
      
       IN_6 = { x!=b }
      OUT_6 = { x!=b }
      
    2. Exercise

      Perform the available expressions analysis on the following program:

      def foo(int a, int b) : int {
        int x;
        int y;
      
        x := a + b;
        y := a * b;
      
        while (a + b < y) {
          a := a + 1;
          x := a + b;
        }
      
        return x;
      }
      

      SOLUTION

      IR:

      def foo(int a, int b) {
        [B0]
        x <- a + b
        y <- a * b
      
        [B1] WHILE_START_0:
        _tmp0 <- a + b
        _tmp1 <- _tmp0 < y
        jump_if_0 _tmp1 WHILE_END_0
      
        [B2]
        a <- a + 1
        x <- a + b
        jump WHILE_START_0
      
        [B3] WHILE_END_0:
        return x
      }
      

      Analysis results:

      E = { a+b, a*b, _tmp0<y, a+1 }
      
       GEN_0 = { a+b, a*b }
      KILL_0 = { _tmp0<y }
      
       GEN_1 = { a+b, _tmp0<y }
      KILL_1 = { _tmp0<y }
      
       GEN_2 = { a+b }
      KILL_2 = { a+b, a*b, a+1 }
      
       GEN_3 = {}
      KILL_3 = {}
      
      ---
      
       IN_0 = {}
      OUT_0 = { a+b, a*b }
      
       IN_1 = { a+b }
      OUT_1 = { a+b, _tmp0<y }
      
       IN_2 = { a+b, _tmp0<y }
      OUT_2 = { a+b, _tmp0<y }
      
       IN_3 = { a+b, _tmp0<y }
      OUT_3 = { a+b, _tmp0<y }
      
  2. GCSE

    Once we have the available expressions info, how do we use it to eliminate common subexpressions? There are a variety of schemes with various tradeoffs. We will look at a simple scheme that works, but requires other optimization passes to clean up after it.

    1. for each expression in E, create a unique variable name (opt0, _opt1, etc).
    2. for each definition var1 <- op1 OP op2, look up the unique name for expression op1 OP op2 (say _optX) and insert an assignment immediately afterwards _optX <- var1.
    3. for each use of expression op1 OP op2, see if it is available at that point (i.e., it is available at the beginning of the basic block and is not killed before the use). If so, replace it with _optX.

    We end up with a bunch of extraneous copies, but they can be cleaned up by another pass like copy propagation and dead code elimination.

    1. Example
      def foo(int a, int b) {
        [B0]
        x <- 1
        y <- a
      
        [B1] WHILE_START_0:
        _tmp0 <- x != b
        jump_if_0 _tmp0 WHILE_END_0
      
        [B2]
        z <- x * 2
        _tmp1 <- z <= b
        jump_if_0 _tmp1 IF_FALSE_1
      
        [B3]
        y <- y * y
        x <- x * 2
        jump IF_END_1
      
        [B4] IF_FALSE_1:
        y <- y * a
        x <- x + 1
      
        [B5] IF_END_1:
        jump WHILE_START_0
      
        [B6] WHILE_END_0:
        return y
      }
      

      Analysis results:

      E = { x!=b, x*2, z<=b, y*y, y*a, x+1 }
      
       IN_0 = {}
      OUT_0 = {}
      
       IN_1 = {}
      OUT_1 = { x!=b }
      
       IN_2 = { x!=b }
      OUT_2 = { x!=b, x*2, z<=b }
      
       IN_3 = { x!=b, x*2, z<=b }
      OUT_3 = { z<=b }
      
       IN_4 = { x!=b, x*2, z<=b }
      OUT_4 = { z<=b }
      
       IN_5 = { z<=b }
      OUT_5 = { z<=b }
      
       IN_6 = { x!=b }
      OUT_6 = { x!=b }
      

      Transformation steps:

      STEP 1

      _opt0 = x != b
      _opt1 = x * 2
      _opt2 = z <= b
      _opt3 = y * y
      _opt4 = y * a
      _opt5 = x + 1
      

      STEP 2

      def foo(int a, int b) {
        [B0]
        x <- 1
        y <- a
      
        [B1] WHILE_START_0:
        _tmp0 <- x != b
        _opt0 <- _tmp0
        jump_if_0 _tmp0 WHILE_END_0
      
        [B2]
        z <- x * 2
        _opt1 <- z
        _tmp1 <- z <= b
        _opt2 <- _tmp1
        jump_if_0 _tmp1 IF_FALSE_1
      
        [B3]
        y <- y * y
        _opt3 <- y
        x <- x * 2
        _opt1 <- x
        jump IF_END_1
      
        [B4] IF_FALSE_1:
        y <- y * a
        _opt4 <- y
        x <- x + 1
        _opt5 <- x
      
        [B5] IF_END_1:
        jump WHILE_START_0
      
        [B6] WHILE_END_0:
        return y
      }
      

      STEP 3

      def foo(int a, int b) {
        [B0]
        x <- 1
        y <- a
      
        [B1] WHILE_START_0:
        _tmp0 <- x != b
        _opt0 <- _tmp0
        jump_if_0 _tmp0 WHILE_END_0
      
        [B2]
        z <- x * 2
        _opt1 <- z
        _tmp1 <- z <= b
        _opt2 <- _tmp1
        jump_if_0 _tmp1 IF_FALSE_1
      
        [B3]
        y <- y * y
        _opt3 <- y
        x <- _opt1
        _opt1 <- x
        jump IF_END_1
      
        [B4] IF_FALSE_1:
        y <- y * a
        _opt4 <- y
        x <- x + 1
        _opt5 <- x
      
        [B5] IF_END_1:
        jump WHILE_START_0
      
        [B6] WHILE_END_0:
        return y
      }
      
    2. Exercise

      IR:

      def foo(int a, int b) {
        [B0]
        x <- a + b
        y <- a * b
      
        [B1] WHILE_START_0:
        _tmp0 <- a + b
        _tmp1 <- _tmp0 < y
        jump_if_0 _tmp1 WHILE_END_0
      
        [B2]
        a <- a + 1
        x <- a + b
        jump WHILE_START_0
      
        [B3] WHILE_END_0:
        return x
      }
      

      Analysis results:

      E = { a+b, a*b, _tmp0<y, a+1 }
      
       IN_0 = {}
      OUT_0 = { a+b, a*b }
      
       IN_1 = { a+b }
      OUT_1 = { a+b, _tmp0<y }
      
       IN_2 = { a+b, _tmp0<y }
      OUT_2 = { a+b, _tmp0<y }
      
       IN_3 = { a+b, _tmp0<y }
      OUT_3 = { a+b, _tmp0<y }
      

      Given the above information, optimize the given program.

      SOLUTION

      STEP 1

      _opt0 = a + b
      _opt1 = a * b
      _opt2 = _tmp0 < y
      _opt3 = a + 1
      

      STEP 2

      def foo(int a, int b) {
        [B0]
        x <- a + b
        _opt0 <- x
        y <- a * b
        _opt1 <- y
      
        [B1] WHILE_START_0:
        _tmp0 <- a + b
        _opt0 <- _tmp0
        _tmp1 <- _tmp0 < y
        _opt2 <- _tmp1
        jump_if_0 _tmp1 WHILE_END_0
      
        [B2]
        a <- a + 1
        _opt3 <- a
        x <- a + b
        _opt0 <- x
        jump WHILE_START_0
      
        [B3] WHILE_END_0:
        return x
      }
      

      STEP 3

      def foo(int a, int b) {
        [B0]
        x <- a + b
        _opt0 <- x
        y <- a * b
        _opt1 <- y
      
        [B1] WHILE_START_0:
        _tmp0 <- _opt0
        _opt0 <- _tmp0
        _tmp1 <- _tmp0 < y
        _opt2 <- _tmp1
        jump_if_0 _tmp1 WHILE_END_0
      
        [B2]
        a <- a + 1
        _opt3 <- a
        x <- a + b
        _opt0 <- x
        jump WHILE_START_0
      
        [B3] WHILE_END_0:
        return x
      }
      

5.6.4 The DFA framework

Now that we've seen some example, let's look a little into the DFA framework and why it works. the theory behind DFA uses a branch of discrete math called Order Theory. Prof. Hardekopf teaches a graduate-level course on Program Analysis that goes into this topic in detail if you are interested, but we will just skim the surface.

There are two main questions:

  1. How do we know that the analysis we have defined is giving us a safe answer?
  2. How do we know that it is decidable, i.e., that it is guaranteed to terminate?

The first question is outside the scope of this class. The second one we can at least get a conceptual idea of why it works. Recall that we use a worklist algorithm to compute the DFA: We pull out a basic block, process it, and if its output changes we put its successors back onto the worklist. This algorithm proceeds until the worklist is empty. So how do we know that it will, eventually, become empty? The DFA framework gives us some requirements that, if they are satisfied, will guarantee termination:

  1. L contains an element that over-approximates all other elements; this is the safest (and least precise) analysis solution.
    • for signedness and constant propagation, that element is Any
    • for available expressions, that element is {}
  2. the ⊓ operator is always defined for any two elements of L a and b, and will give us the element of L that most closely approximates both a and b.
  3. from any given element of L, there are only a finite number of elements that over-approximate it until we get to the "maximal", safest element.
    • This is where constant propagation would break if we used elements of L like ranges or sets of numbers. Since there are infinite integers, there are infinite ranges/sets that over-approximate any given solution. There are ways to guarantee termination in this case using other operations that approximate ⊓.
  4. The abstract transfer functions are monotone. a function is monotone if, when x ≤ y, f(x) ≤ f(y). In other words, the function preserves the order of the elements.
    1. f(x) = x + 1 is monotone
    2. f(x) = -x is not monotone

These, taken together, guarantee termination. Why? Recall that each time we process a basic block in the worklist algorithm, we (1) take the meet of the input solutions; then (2) apply the abstract transfer function; then (3) if the output solution changed, add the successor basic blocks to the worklist.

The first three requirements mean that, for a given basic block, the meet of its input solutions can only get less precise (never more precise) and we can only take the meet of its input solutions a finite number of times before we get a stable input that never changes (the safest, least precise solution in the worst case, but something more precise than the safest solution if we are lucky).

The last requirement means that, if we are getting an input that is less precise, then we cannot get an output solution that is more precise. That is, since the input solution can only get less precise, the output solution can only get less precise.

But we know that there is a limit to how imprecise we can get (the "maximal" solution) and we will reach it in a finite number of steps. Therefore, eventually we are guaranteed that at some point we will process a basic block and its output won't change. But that means that we won't put any successors on the worklist. This reasoning is true for all basic blocks, and therefore eventually we are guaranteed that no basic block will put any successors on the worklist. Therefore, the worklist will eventually become empty.

The reasoning i've given you here is rather vague and hand-wavy; the technical description and proof requires a lot more math that we won't get into. But hopefully this conveys the basic idea of why we want to define a DFA the way that we do.

5.7 Order of optimizations

The order of optimizations can make a big difference in how effective they are (and also how many times we apply them).

We saw for GCSE that our optimization left a lot of copy assignments and dead variables, which we said could be cleaned up by other optimizations (like copy propagation and dead code elimination). It turns out that many optimizations modify the code in such a way that they enable other optimizations, and this can even be cyclic (A enables B which enables A again).

So what is the best order? This is not an easy question to answer, and heavily depends on the code being optimized. The optimization levels in a compiler such as gcc or clang (-O1, -O2, -O3, etc) are just convenient flags that specify a series of optimizations determined by the compiler developers to "usually" do a pretty good job. You (the compiler user) can actually specify the exact set of optimizations that you want to apply and what order you want to apply them in, and if you do it right you can often get even faster code than by using the default compiler flags. however, Figuring out what optimizations to use in what order can take a lot of trial and error, and you will never know whether you found the best possible ordering. In fact, there has been some research on using several methods (such as mathematical optimization algorithms) to find a good sequence of compiler optimization for different kinds of programs.

6 Runtime support: Garbage collection

We often want to have data structures in our programs rather than mere integers, so we can represent structured data such as records, trees, stacks, and graphs. Also, we were relying on the stack to handle memory management, but that limits the lifetime of the objects in our language: an object is deallocated once the scope it is created ends. Sometimes, we want our data structures to outlive the scope they are created in (for example, if we want to add a new node to a graph by calling a function createNode(Graph g), we'd like that node to survive after createNode returns. Dynamic memory allocation allows us to solve this problem: the objects allocated on the heap are not bound to the scope they are created in. However, we need to solve the problem of when to destroy these objects. There are several options such as manual memory management (á la C or C++), having a type system that reasons about memory (á la Rust, ATS), and having a garbage collector that will destroy objects that the program can no longer access (á la Python, Swift, Java). We will implement the last strategy.

So, let's extend \(C\flat\) with these two concepts to create our next language: L2. We are going to limit L2 in certain ways to make our garbage collection scheme simpler:

  • We will allow variables to be declared only in the beginning of the function, this will make analyzing stack frames easier.
  • All fields and variables are either integers or pointers, this makes sure that each variable and field has the same size (a machine word, 4 bytes).

6.2 New frontend

There is nothing novel for lexing and parsing, a small extension of what we've already done.

AST validation is more interesting, specifically typechecking now that we have more than one type. we won't talk about it in detail for now (we will just assume correct programs), but essentially we need to verify:

  • struct field access is to struct that contains that field
  • no pointer arithmetic in arithmetic expressions
  • the loop expressions produce integers

Similar to C♭, we will do these checks in the code generator, and we will defer some of them to the type checker which will be covered later in class.

6.3 New codegen

For codegen we need to handle the following:

  • struct allocation on the heap
  • struct field access in an expression
  • struct field access on the left-hand side of an assignment

We also need to ensure that codegen makes runtime memory management possible.

6.3.1 Managing memory: heap vs stack

For C♭ we didn't worry about memory management because the stack took care of it for us. When we enter a new scope the declared variables are allocated space by pushing them on the stack; when we leave that scope the space is deallocated by popping them off the stack. However, this means that the lifetimes of the variables are dictated by the scope they are declared in; once we leave that scope the variables disappear.

If we want objects that will live beyond the scope they are created in, we need the heap. That's the main difference between the stack and the heap: heap lifetimes are not bound by scope. But if we don't deallocate heap objects when we leave the scope, when do we do it? that is the subject of memory management, and there are a number of different possible answers with their own tradeoffs. We will discuss the subject thoroughly in a later lecture.

For L2 specifically, we will rely on garbage collection to automatically deallocate structs when it's safe, but to do that we will need to change the codegen in our compiler to make it feasible to perform GC at runtime. We will discuss the requirements as they become relevant during codegen.

6.3.2 Symbol table

We need to extend the symbol table in two ways:

  • map typename to struct info (field names, types, and offsets)
  • variable info needs to include type info

Other than adding the struct info first thing, we can handle the symbol table the same way we did before (just remembering to add the type info from the variable declarations alongside the memory locations).

6.3.3 Struct allocation

Let's examine a concrete example from the handout:

x := new %tree;

where %tree is defined as:

struct %tree {
  int value;
  %tree left;
  %tree right;
};

The runtime memory manager is going to handle the heap, so we will delegate memory allocation to it by making a call to a predefined function allocate using the calling convention we discussed for L1. The allocate function will need to know how many words to allocate, which we can determine by looking up <typename> in the symbol table to get the corresponding struct info. Our contract is that when we call allocate with a number of words (one per field), it will allocate sufficient space on the heap to hold those words and return the beginning address of the newly allocated space (which address we will store into the left-hand side of the assignment).

For our example, we would call allocate(3) and get back an address (say 0x100) at the beginning of 3 words of heap memory:

Lower ---------------> Higher

|   |   |xxx|xxx|xxx|   |   |
   0x100^

and we would write the value 0x100 into 'x'. We then need to generate code to initialize the values of the fields, which as in C♭ are always initialized to 0 (which is also the value of nil for references):

|   |   | 0 | 0 | 0 |   |   |
   0x100^

However, the runtime memory manager will need more information about the struct than its size in order to handle it correctly, namely it needs to know which of its fields are pointers to other structs. This is information that we have at compile-time but we need it at runtime (the reason we need it will become clear when we cover memory management schemes). To communicate this information, allocate will actually create space for <num_words>+1 words of memory, with the extra word being before the returned address:

|   |xxx|xxx|xxx|xxx|   |   |
   0x100^

this header word is going to contain the following information:

  1. the first byte, bits 0–7, holds the number of fields in the struct
  2. bits 8–30 are a bit vector s.t. a bit is set iff the corresponding field in the struct is a reference.
  3. the last bit, bit 31, is set to 1 (for reasons that, again, will become apparent when we discuss memory management).

The compiler must generate code to fill in that information. For our example, the result of the generated code should be:

|   |00000011011000000000000000000001| 0 | 0 | 0 |   |   |
                                0x100^

that is, there are three fields and the second two are pointers. this scheme means that, given a pointer to a struct containing an address, the runtime can always read the word at address-4 to get the size of the struct and which fields are pointers.

Notice that this scheme limits the number of fields a struct can have: at most 23, the size of the bitvector. We are using this scheme because it is relatively simple, but constraining the number of fields is a trade-off. We could use a more elaborate scheme to remove this restriction. for example, we could map each typename to an index, copy the symbol table to the static memory segment indexed by typename, and have the header word contain the appropriate typename index instead of struct size and a bitvector. Then the runtime can read the header word and use that index to lookup the information in the static memory's symbol table. That scheme is more flexible, but more complex and expensive. which one to use is a matter of language design.

To recap, when we generate code for <access> := new <typename>, we:

  1. generate a call to function allocate passing an argument that is the number of fields in <typename>.
  2. write the return value to the left-hand side of the assignment.
  3. for each field of the struct being allocated, store a 0 to the corresponding offset from the returned address.
  4. compute the header word and write it to a -4 offset from the returned address.

6.3.4 Field access in expression

Example 1:

struct %foo {
    int a;
    int b;
};

%foo x;
x := new %foo;
output x.b;

Let's focus on the output expression. x contains the address of a foo struct object in memory, which has two integer fields.

In order to retrieve the correct value, we need to dereference x to get the address of the struct object in the heap, compute the offset into that struct object of field b, then read the value at that address from memory.

AST: [access x b]

note: for convenience i'm going to be more efficient about the generated code rather than strictly following our naive codegen algorithm.

ld [FR-4] RR  ; get the address stored in 'x'
ld [RR+4] RR  ; get the value at that address + 4, because b is at a 4-byte offset from the beginning of the struct

How did we know to use offset 4? Because we look up x in the symbol table to get its type foo, then look up foo to get the information about the field offsets. remember that the extra header word is immediately before the returned address, so offset 0 would be the first field of the struct.

Example 2:

struct %foo {
    int a;
    %bar b;
};

struct %bar {
    int c;
    int d;
    %baz e;
};

struct %baz {
    int f;
    int g;
};

%foo x;
x := new %foo;
x.b := new %bar;
x.b.e := new %baz;
output x.b.e.f;

Let's focus again on the output expression. x, x.b, and x.b.e will all contain pointers into the heap; we are going to do the same thing as before except chained in a row.

AST: [access [access [access x b] e] f]

When we recursive through the AST for codegen, we will end up generating code for [access x b] first, then [access result e], then [access result f]:

ld [FR-4] RR  ; get the address stored in 'x'
ld [RR+4] RR  ; get the address stored in 'x.b' at offset 4
ld [RR+8] RR  ; get the address stored in 'x.b.e' at offset 8
ld [RR] RR    ; get the int stored in 'x.b.e.f' at offset 0

Again, we look in the symbol table to compute the offsets. Note that everything starts with looking up the type of x to see that it~s a foo, then looking at foo to see that the b field is at offset 4 and type bar, then looking at bar to see the e field is at offset 8 and type baz, then looking at baz to see the f field is at offset 0.

6.3.5 Field access on lhs of assignment

When the access path is on the left-hand side of an assignment we need to treat it differently. this is actually the same as when it's just a variable as in C♭, but we didn't highlight the difference. let's look at it now, then generalize to arbitrary access paths.

Example:

x := x + 1;

When we evaluate the x in x + 1 we want to find the address of x in memory and retrieve the value stored there:

ld [FR-4] RR
add 1 RR

but when we look at x in the left-hand side we want just the address of x, which is where we will put the final value:

store RR [FR-4]

we say that x in the right-hand side expression is an rvalue and that x on the left-hand side is an lvalue. basically, for an lvalue we want to stop at the address to store the value, while for an rvalue we want to go ahead and dereference that address to get the value currently stored there.

Now let's look at some non-trivial access paths:

struct %foo {
    int a;
    int b;
    int c;
};

%foo x;
x := new %foo;
x.b := x.c;

as usual, we want to evaluate the right-hand side to a value, then store it into the location specified by the left-hand side:

ld [FR-4] RR  ; get the address stored in 'x'
ld [RR+8] RR  ; get the int stored at offset 8
ld [FR-4] OR  ; get the address stored in 'x'
store RR [OR+4] ; store the right-hand side value into 'x.b' at offset 4

What if we have an lvalue access path with multiple fields?

struct %foo {
    int a;
    %bar b;
    int c;
};

struct %bar {
    int d;
    int e;
};

%foo x;
x := new %foo;
x.b := new %bar;
x.b.d := x.c;

then we follow the access paths as normal, except that we stop right before we dereference the last one:

ld [FR-4] RR  ; get the address stored in 'x'
ld [RR+8] RR  ; get the int stored at offset 8
ld [FR-4] OR  ; get the address stored in 'x'
ld [OR+4] OR  ; get the address stored at offset 4
store RR [OR]   ; store the right-hand value into 'x.b.d' at offset 0

6.3.6 Enable collecting the root set for GC

The last piece we need to worry about is how codegen will interact with our runtime memory management. we already handled one aspect: when we allocate a struct, the codegen will be sure to write the necessary information about the struct fields into the header word. but there's one more thing to do.

For our L2 memory management we are going to use a scheme called a tracing garbage collector. We will look into what that actually means in a different lecture, but for now just accept the following: for that scheme to work, the GC needs to be able to look at the function stack at any point during execution and identify which memory locations in the stack represent pointers to structs in the heap.

This is more difficult than it might sound. Remember that the stack is constantly growing and shrinking, and the same memory location may represent a pointer or not depending on when exactly during program execution we look at it. How can we implement codegen in order to enable the runtime GC to collect this information?

There are many possible strategies with different tradeoffs between complexity, flexibility, and efficiency. i'm going to describe one strategy that, going along with our usual mantra of "make it as simple as possible as long as it works", focuses on being easy to implement at the expense of some flexibility.

Key to this strategy is that the only heap pointers on the stack will be either function parameters or function local variables. here's the idea:

  1. move all nested variable declarations up to the top of the function they're in, renaming them as necessary to avoid name clashes. this transformation expands the amount of stack memory the function will consume, but otherwise doesn't change the program behavior (as long as we remember to initialize the values to 0 at the same points as before).
  2. modify the prologue instruction sequence for function codegen, specifically immediately after we push the old frame pointer value onto the stack: push two additional words onto the stack after that but before allocating stack space for local variables. remember that we need to adjust the offsets of the local variables accordingly in the symbol table. call these two words the "argument info word" and the "local info word".
  3. in the argument info word, set it as a bit vector s.t. a bit is set iff the corresponding function parameter is a struct pointer. remember that the function arguments are pushed onto the stack by the pre-call instruction sequence, so by looking at the argument info word we can tell which positive offsets from the frame pointer correspond to a pointer argument.
  4. in the local info word, set it as a bit vector s.t. a bit is set iff the corresponding local variable is a struct pointer. remember that the function locals are pushed onto the stack by the prologue instruction sequence immediately after the local info word, so by looking at the local info word we can tell which negative offsets from the frame pointer correspond to a pointer local.

So, how would this work at runtime to identify all heap pointers on the stack? recall that the current frame pointer is always holding the address of the old frame pointer for the caller function (this is enforced by our calling convention). at the point when we need to compute this information:

  1. get the current frame pointer.
  2. read the argument info word (at frame pointer - 4); for each set bit representing offset X, return the memory address 'frame pointer + X'.
  3. read the local info word (at frame pointer - 8); for each set bit representing offset X, return the memory address 'frame pointer - X'.
  4. get the value pointed to by the current frame pointer, which is the value of the old frame pointer.
  5. set the current frame pointer value to that old value, then go to 2.

repeat these steps until we have walked the entire stack and looked at all of the stack frames. then we have returned all possible pointers into the heap.

  1. example

    PROGRAM:

    struct %foo { int a; };
    
    def bar(%foo c) : int {
        %foo d;
        int e;
    
        d := new %foo;
        if (0 < c.a) { d.a = c.a - 1; e = bar(d); }
    
        return e;
    }
    
    %foo x;
    int y;
    
    x := new %foo;
    x.a := 10;
    y := bar(x);
    
    output y;
    
    

    Suppose we execute this program, and after a few function calls decide to find all pointers on the stack. assume there are no caller- or callee-save registers for convenience.

    STACK:

    ADDRESS | CONTENTS
    --------+------------------------------------------------------------
    0x900:  | 0x1000 ; old frame pointer
            | 0x0    ; argument info word: there are no pointer arguments
            | 0x1    ; local info word: 'x' is a pointer
            | 0xab0  ; the value of 'x', a pointer into the heap
            | 0x0    ; the value of 'y', an int
            | 0xab0  ; the call argument, i.e., 'x'
            | 0xd0f  ; the return address
    0x800:  | 0x900  ; old frame pointer
            | 0x1    ; argument info word: there is 1 pointer argument
            | 0x01   ; local info word: there is 1 pointer local
            | 0xac0  ; the value of 'd'
            | 0x0    ; the value of 'e'
            | 0xac0  ; the call argument, i.e., 'd'
            | 0xdef  ; the return address
    0x700:  | 0x800  ; old frame pointer
            | 0x1    ; argument info word: there is 1 pointer argument
            | 0x01   ; local info word: there is 1 pointer local
            | 0xad0  ; the value of 'd'
            | 0x0    ; the value of 'e'
    
    

    This is the stack after the first recursive call to 'bar'. the current frame pointer is 0x700, which points to the most recent 'old frame pointer' on the stack.

    1. go to 0x700-4 to get the argument info word
    2. there is 1 set bit: return 0x700+8 as the memory location of a pointer
    3. go to 0x700-8 to get the local info word
    4. there is 1 set bit: return 0x700-12 as the memory location of a pointer
    5. set the current frame pointer as 0x800 (the address of the next most recent frame pointer)
    6. go to 0x800-4 to get the argument info word
    7. there is 1 set bit: return 0x800+8 as the memory location of a pointer
    8. go to 0x800-8 to get the local info word
    9. there is 1 set bit: return 0x800-12 as the memory location of a pointer
    10. set the current frame pointer as 0x900 (the address of the next next most recent frame pointer)
    11. go to 0x900-4 to get the argument info word
    12. there are no pointer arguments
    13. go to 0x900-8 to get the local info word
    14. there is 1 set bit: return 0x900-12 as the memory location of a pointer
    15. set the current frame pointer as 0x1000 (the address of the next next next most recent frame pointer)
    16. say that 1000 is the base of the stack, so we're done.

6.3.7 Exercise

generate code for the following program (again assuming no caller- or callee-save registers):

struct %foo {
  int a;
  %bar b;
};

struct %bar {
  int c;
  int d;
};

def fun(%foo p, int q, %bar r) : int {
  %foo s;
  s := p;
  return s.a;
}

%foo x;
int y;

x := new %foo;
x.b := new %bar;
y := fun(x, 2, x.b);
output y;

SOLUTION

; entry prologue
push FR
mov SR FR
push 0b0      ; argument info: no pointer arguments
push 0b1      ; local info: x is a pointer
sub 8 SR      ; allocate stack space for x and y
store 0 [FR-4]  ; initialize x to nil
store 0 [FR-8]  ; initialize y to 0

; x := new foo
push 2        ; argument to 'allocate': 2 words
call allocate
add 4 SR      ; deallocate stack space for argument
store RR [FR-4] ; write return value to 'x'
store 0 [RR]    ; initialize x.a to 0
store 0 [RR+4]  ; initialize x.b to nil
store 0b00000010010...01 [RR-4] ; initialize header word

; x.b := new bar
push 2        ; argument to 'allocate': 2 words
call allocate
add 4 SR      ; deallocate stack space for argument
ld [FR-4] OR  ; get address of struct in heap from 'x'
store RR [OR+4] ; write return value to 'x.b'
store 0 [RR]    ; initialize x.b.c to 0
store 0 [RR+4]  ; initialize x.b.d to 0
store 0b00000010000...01 [RR-4] ; initialize header word

; y := fun(x, 2, x.b)
ld [FR-4] RR  ; address stored in 'x'
ld [RR+4] RR  ; address stored in 'x.b'
push RR       ; push 'x.b' argument to 'fun'
push 2        ; push '2' argument to 'fun'
ld [FR-4] RR  ; address stored in 'x'
push RR       ; push 'x' argument to 'fun'
call FUN
add 12 SR     ; deallocate stack space for arguments
store RR [FR-8] ; write return value to 'y'

; output y
ld [FR-8] RR

; entry epilogue
mov FR SR
pop FR
ret

; FOO prologue
FOO:
push FR
mov SR FR
push 0b101    ; argument info: first and third arguments are pointers
push 0b1      ; local info: one local pointer
add 4 SR      ; allocate stack space for 's'
store 0 [FR-4]  ; initialize 's' to nil

; s := p
ld [FR+16] RR ; get value of 'p' param
store RR [FR-4] ; put it in 's'

; return s.a
ld [FR-4] RR  ; get value of 's'
ld [RR] RR    ; get value of 's.a'

; FOO epilogue
mov FR SR
pop FR
ret

6.4 Memory management schemes

See memory management slides on Slack

6.5 Further Reading

  • The e-book Crafting Interpreters has a great chapter on garbage collection that showcases how to implement mark-and-sweep algorithm. Beware that the language used in the book has plenty of features we do not deal with (such as closures), and it uses a virtual machine/interpreter which introduces interesting trade-offs for garbage collection.
  • If you are familiar with some techniques not covered in this class such as continuation-passing style, Henry Baker's paper Cheney on the MTA presents an interesting implementation for garbage collection where everything is allocated on the stack and the stack is garbage-collected.

Footnotes:

1
The former allows private repositories at least when you join the UCSB GitHub organization.
2
We are going to go over what these stages are in later lectures
3
\(\mathit{whileLoop} ::= \texttt{while } \mathit{rexp} \texttt{ \{ } \mathit{block} \texttt{ \}} \) is much more readable than \( \mathit{whileLoop} ::= \texttt{< WHILE > } \mathit{rexp} \texttt{ < LBRACE > } \mathit{block} \texttt{ < RBRACE >} \)
4
we will use these numbers in our derivations to show which rules are applied when
5
For example, the grammar for the "dangling else" problem above is ambiguous but there are unambiguous grammars for that language.
6
such as the concrete grammar given in the handout, which is definitely ambiguous as it contains the part of the arithmetic expressions above which we showed is ambiguous
7
The technique in this grammar is what C uses to resolve this ambiguity.
8
LR stands for Left-to-right, Rightmost derivation in reverse
9
LL stands for Left-to-right, Leftmost derivation \( k \) is called lookahead and determines how much peeking ahead the parser needs
10
There is only one part of the grammar that is not LL(1) and it turns out to be LL(2) so our parser is still an LL(k) parser
11
Every time we recursively call a function we get a new instance of all the parameters and local variables. This happens because the parameters and locals are stored on the call stack; whenever a function is called a new "stack frame" is pushed onto the stack; whenever a function returns its stack frame is popped off of the stack.
12
Like \( E ::= E + E | x \) for example.
13
For example, by specifying it along with the grammar and handling precedence separately in the parser. One example of this is Shunting Yard Algorithm.
14
otherwise it would be non-productive as it cannot terminate
15
in our case, 1 token–except for when disambiguating assignments and function calls in C♭.
16
See the top-down parsing chapter in Cooper & Torczon or slide 3 of this slide deck.
17
See the same resources as the FIRST sets for the algorithm, if you are curious about it.
18
The additional information in the parse tree may be useful for other applications such as code formatting tools that need to keep track of the parentheses and other parsing-related information.
19
accessing the wrong segment, although the virtual memory is vast these days and segments are more interleaved however the term has carried on and this simple mental image is useful
20
Common Language Runtime, the virtual machine that C# and other languages on .Net platform compile to.
21
When we were talking about optimization scope and said that local optimizations operate on "small fragments of code", what we really meant was basic blocks.

Author: Mehmet Emre

Created: 2020-11-25 Wed 15:20

Validate