CS170 -- Autograding and Testing

this page last updated: Tue 14 Sep 2021 09:02:33 AM PDT

Gradescope and Autograding

One of the "features" of operating systems that distinguishes it from other forms of software development is that it is impossible to determine definitively that your operating system will work correctly once it has been released. Unlike curated web services, for example, once the OS is released, it is difficult and expensive to remeidate errors or bugs. Further, when it fails, you (as the developer responsible) will often hear that it is failing, but will not know why. Users don't always feel like they have an inclination or an obligation to send you their codes when their codes are crashing your OS. Instead, they give your work a bad reputation and try to encourage their co-workers (and their management) to use a different technology.

Thus, you must test your OS extensively before it ships. While impossible to achieve, your goal is to exercise your OS so much more rigorously than any "normal" user would so that the chances of a bug are confined to "corner cases" or "anomalous operational circumstances."

Please read the previous paragraph again before we move on to discuss autograding and Gradescope.

In this class, to make the lives of the TAs barely tolerable, we will use a web-based technology called Gradescope to collect and grade your programming assignments. Gradescope includes an autograding feature that will build your code and apply a fixed set of tests to it. It then compares the textual output of these tests to the textual output of a known-correct solution and displays this difference. Unfortunately, when they differ, gradescope colors the test red and considers the test failed and when they are the same, it colors the test green and considers the test "passed."

While this circumstance sounds reasonable enough, it turns out to be the source of tremendous frustration, particularly if you do not understand the previous paragraphs.

Gradescope is not a Q/A (quality assurance) system.

When you submit your assignments, it runs a basic set of tests to sanity check your solution. The sanity checks are designed to ensure that the code build is structured properly and that the basics are, more or less, present. As a consequence, passing all of the submission tests BEFORE the submission deadline does NOT mean that your solution is correct -- it means that it is not obviously broken.

After the deadline for each assignment, Gradescope will automatically stop accepting submissions (it will retain the last solution you submitted before the deadline). It will then switch to use a different set of tests and your grade will be determined by these final acceptance tests.

All submission tests are available in /cs/faculty/rich/cs170/test_execs.

The final acceptance tests will not be made public.

Thus the first source of frustration is illustrated by the following hypothetical scenario. CS170 Team A submits (usually repeatedly) potential solutions to a lab assignment until Gradescope reports all sanity checks as green. Team A then considers its solution both "done" and "correct" and it ceases to test and enhance its solution. The due date passes, Gradescope runs the acceptance tests, and the solution fails one or more of these tests. Team A knows that the next lab assignment depends on the correct function of this lab assignment and, so, wants access to the final acceptance tests to use them to debug their solution but the acceptance tests are not public.

Adversarial Testing

This method of grading is designed to mimic a software engineering practice called adversarial testing which is used in contexts (like operating systems) where software quality is at a premium and it is difficult to develop an exhaustive set of acceptance tests. It is certain that Team A can "make their solution work" for any test that they can use as a debugging tool.

For example, once you see the correct output, you could simply write a program that prints the correct output and submit that in place of your OS. Gradescope will run that program, compare its output to the correct output, and if your prints statement are correct, record that your solution passes the test.

Less degenerately, most students can "code to the test" meaning that once the test is completely understood, they can write a code that implements the test correct but does not necessarily implement a fully correct solution.

In this class, you will need to write your own tests, and to use them on your submissions that are extensive enough to "cover" the things that the unseen final acceptance tests cover. We will not test "tricky" corner cases or features that are undocumented in the Linux documentation. However, you will NOT be able to rely on the sanity checks performed by Gradescope to determine whether your solution is correct and complete.

How do I know when I'm finished?

In some sense, you don't, since it is impossible to test everything. However, as mentioned, we will only test the "core" functionality of your OS so once you have convinced yourself that you have tested that core functionality thoroughly, you can be reasonably assured that your testing covers the acceptance tests.

How do I know if my solution is correct?

Here, the answer is to use Linux as a digital clone. If your solution works the same way that Linux works, then it is correct.

That sounds deceptively simple, and it is, but perhaps not as simple as one might expect.

The first thing to realize is that every test you write for your OS (and every test we supply), you can compile and run on Linux itself. The lab preparation instructions and the KOS lecture notes describe how to build tests for your labs using an (elaborate) cross-compiler. If you simply build the same tests using the version of gcc that is installed on csil.cs.ucsb.edu you can run the test on csil.cs.ucsb.edu and compare the output to the output from your lab.

However.

There are a few caveats. The first is that Linux behaves differently when you type things in from the keyboard compared to when it reads input from a file. This page describes the issues in detail and also a way (if you are really serious about implementing Linux functionality) to make your solution perfectly congruent.

In this class, however, the limitations of Gradescope prevent us from testing Linux keyboard input. Thus you should test your codes using a shell redirect to give them input. Put another way, the TAs will not be launching your OS and typing inputs to it via the keyboard. Because Linux behaves differently (sometimes) when you use the keyboard for input, it is important that you test using file input as well.

Operating Systems are Asynchronous

Another way in which autograding can cause consternation is due to the asynchronous nature of an operating system. Hopefully the lectures will make this issue clear, but in case they do not, consider the following example.

Imagine that you have written a test code that looks like


#include < stdlib.h >
#include < stdio.h >

main()
{
	int pid;
	int my_pid;
	int i;

	pid = fork();
	if(pid < 0) {
		exit(1);
	}
	my_pid = getpid();
	for(i=0; i < 10; i++) {
		printf("pid: %d, i: %d\n",my_pid,i);
	}
	exit(0);
}

Don't worry if, before Lab2, you don't understand this code exactly. What it does is to create two processes (via the fork() system call and each one prints out its process id and a counter.

Here is the Linux output when I ran it early one morning before the quarter began

pid: 9533, i: 0
pid: 9533, i: 1
pid: 9533, i: 2
pid: 9533, i: 3
pid: 9533, i: 4
pid: 9533, i: 5
pid: 9533, i: 6
pid: 9533, i: 7
pid: 9533, i: 8
pid: 9533, i: 9
pid: 9534, i: 0
pid: 9534, i: 1
pid: 9534, i: 2
pid: 9534, i: 3
pid: 9534, i: 4
pid: 9534, i: 5
pid: 9534, i: 6
pid: 9534, i: 7
pid: 9534, i: 8
pid: 9534, i: 9

And here is the output from a working lab solution

pid: 1, i: 0
pid: 2, i: 0
pid: 1, i: 1
pid: 2, i: 1
pid: 1, i: 2
pid: 2, i: 2
pid: 1, i: 3
pid: 2, i: 3
pid: 1, i: 4
pid: 2, i: 4
pid: 1, i: 5
pid: 2, i: 5
pid: 1, i: 6
pid: 2, i: 6
pid: 1, i: 7
pid: 2, i: 7
pid: 1, i: 8
pid: 2, i: 8
pid: 1, i: 9
pid: 2, i: 9

If the lab solution is working, why are the outputs different?

There are two answers. The first has to do with asynchrony. Our lab solutions will use a different scheduler from the one that Linux uses to decide what to do "next" then the code does not define an explicit order. The C program does not define what order the two processes execute so the OS is free to choose any legal order. In this example, Linux chose to run one process (pid: 9533) until it finished and then to switch to process the other process, but the lab solution interleaved them. Both orderings are correct.

How can you tell, then, if your answer is correct? To do so, you will need to understand exactly what the test is doing and what the OS system call implements. In this case, you need to understand the following:

The fork() system call creates two processes that are unsynchronized (meaning any interleaving of the execution of these processes is legal).
The printf() library call makes a write() system call.
The write() system call will print out everything in the printf() statement without interruption.
Linux and KOS use different ranges of numbers for process IDs.

Armed with this information you can determine that the lab solution is consistent with the Linux solution.

When you have generated enough tests that are consistent with Linux, you are reaonably sure that your solution is correct.

Operating System Conventions

While the KOS software infrastructure may seem contrived, it isn't. In particular, the cross-compiler that generates test programs for the MIPS R3000 is the actual version of GNU gcc that shipped with the Ultrix (a forerunner of Linux) OS. Legitimately, the C compiler chooses variable layouts in process memory according to its own internal mapping algorithms, which vary from compiler to compiler and OS to OS.

Oridinarily, the specific variable addresses that the compiler chooses are of little interest. However, in this class, it is often useful to print out and examine variable layouts, particularly on the stack. Worse, the "cook books" that many students choose to use to complete the lab assignments rely on an understanding of these variable layouts and the compiler's control over them.

As an example, consider the program argtest which prints the addresses of argc and argv on the stack.


#include < stdio.h >

main(int argc, char **argv, char **envp)
{
        int i;
        char buf[256];


        sprintf(buf, "&argc is -->%u<--\n", &argc);
        write(1, buf, strlen(buf));
        sprintf(buf, "argc is -->%u<--\n", argc);
        write(1, buf, strlen(buf));
        sprintf(buf, "argv is -->%u<--\n", argv);
        write(1, buf, strlen(buf));
        sprintf(buf, "envp is -->%u<--\n", envp);
        write(1, buf, strlen(buf));
        for (i=0; i%s<--\n", i, argv[i], argv[i]);
                write(1, buf, strlen(buf));
        }
        if (envp != NULL) {
          sprintf(buf, "envp[0] is -->%s<--\n", envp[0]);
          write(1, buf, strlen(buf));
        }
        exit(&argc);
}

Here is the output from a correct solution

&argc is -->1048472<--
argc is -->4<--
argv is -->1048520<--
envp is -->0<--
argv[0] is (1048556) -->argtest<--
argv[1] is (1048551) -->Rex,<--
argv[2] is (1048548) -->my<--
argv[3] is (1048543) -->man!<--

and here is the output shown in one of the cook books


&argc is -->1048472<--
argc is -->4<--
argv is -->1048520<--
envp is -->0<--
argv[0] is (1048540) -->argtest<--
argv[1] is (1048548) -->Rex,<--
argv[2] is (1048553) -->my<--
argv[3] is (1048556) -->man!<--

Are they both correct?

Turns out that they are. The C compiler requires that the address of argc and the address of argv[0] be in certain locations on the stack, and that argv be an array that contains argc+1 elements (the last element being NULL). However, once it locates the address of argc and the address of argv[0] it doesn't much care where the values of argc and argv[0] are stored.

Worse, the order in memory where the strings are stored also doesn't matter. Each string must be stored contiguously and must be NULL terminated, and the first string ("argtest" in this example) must have its address in argv[0], the second string ("Rex,") must have its address in argv[1], and so on, but the strings themselves can be anywhere.

The first solution and second solution, then, simply differ in the order that the solutions chose to list the strings. Stack in Linux grow from high addresses down to low addresses. In solution 1, the first argument (the string pointed to by argv[0]) starts at a higher address (1048556 in the example) than the second argument (1048551 in the example), on so on, with each success argument starting at a lower address than the one before it. If you count the characters and the NULLs you'll see that they are all next to each other in memory. Further, the address of argv[0] (listed as argv in the output) is in a lower address than the lowest address show in any string. Thus you can deduce that the strings are on the stack at higher addresses than argv and that the arguments are placed on the stack with the in reverse order relative to the addresses.

In the second example from the cook book, argc and argv are in the same place, but the solution puts the first argument at a lower address than the second argument, and so on. Thus in the second solution, the lower arguments occupy lower addresses on the stack (not higher ones as in the first solution).