CS170: Lab 1 - Helpful Hints from your Aunt Heloise

CS170: Lab 1 - Threaded Matrix Multiply and How I Went About It

You may have a good idea of how you plan to implement the code necessary to complete Lab 1. If so, then you should use your idea. There are many ways to accomplish this lab and as long as your solution conforms to the requirements, it will be a fine solution.

However if the assignment seems a little daunting, then perhaps a discussion of the implementation strategy I used will be helpful.

Understand the Problem

The first thing I did was to read up on matrix multiply. You should do this too. Like me, you have probably seen the algorithm some time in the past (in my case it has been more than a decade) and you may have even coded it before. Still, the devil is often in the details. Knowing what the code must do ahead of time is really the place to start.

In that vein, you should also read the text of the assignment carefully. And then, you should read it again -- carefully. Understanding the full assignment completely makes it easier to see how to plan and execute it. If you try and start before you know how it will end, you may wind up coding yourself into a corner. I've made this mistake more times than I can count when under a deadline. You should avoid this common pitfall by having a complete grasp of what is required before you begin and that grasp will only come from reading the assignment CAREFULLY.

Then, before you begin, read the lecture notes from the class on pthreads -- again, carefully. You will need to understand how pthreads creates and synchronizes between threads. If you don't quite get how that works (read through the code examples) keep reading and do some research until you are comfortable with the basics.

Make a Staged Plan

The key to this lab (and most of the other labs) is not to try and code everything at the beginning. Yes -- I know. The strategy of writing everything first and then debugging might lead you to believe that you will get more partial credit since you can always claim it was "almost working." Unfortunately, your instructor doesn't quite buy this argument in the same way that the TAs won't buy it, your future boss won't buy it, and your co-workers won't buy it when you are late with your part of a joint project. In this class, you will be much better off with working code than with a file full of C programming that you've crammed in at the last minute.

Therefore, your best strategy is to try and get the program working in stages. To do so, you are going to want to write modules and test routines at the same time. Each time you add a feature you should also write a function to test that feature.

Believe me. This is the way to become comfortable with large-scale projects that need to be completed on a deadline. You will go far if you can master this skill.

For the matrix multiply code I decomposed the development (including test code development) into the following stages (to be completed in order):

write the I/O routines
write matrix multiply without threads
convert this non-threaded matrix multiply code to work with a single thread
convert the single-thread version to do a strip decomposition with multiple threads
convert the strip-decomposition to do rectangular decomposition (the extra credit solution)

Each stage involves writing the code and one or more test codes and I didn't move on to a higher numbered stage before finishing a previous stage.

Here is the reasoning I used and some of the testing I did.

Writing the I/O routines

First, you need the I/O routines no matter what you do. Trying to retrofit them at the end sounds safe but in my experience you can introduce a whole bunch of nuisance bugs right before the deadline when you leave I/O and format specification to the last minute.

You will need (at least) one routine to read a matrix in and another to print a matrix out. You might need other routines for debugging, but I managed the assignment with just those two.

Writing the I/O routines first forces you to think about your data structures. If you followed my advice so far, you will notice that a matrix can be represented as a data structure with rows, columns, and space to hold the elements in row-major order. For example,

struct matrix
{
	int rows;
	int columns;
	double *data;
};

Notices that to create a matrix using this structure you will need to use malloc() to allocated the space you need for the data dynamically. Similarly when you deallocate a matrix, you'll need to free that space.

The way I handle this situation in a C program is to write a constructor and a destructor for each of my internal data types. Thus I wrote a routine to allocate a matrix and another to free one that I use any place in the code where I need a matrix data structure.

In my code, the read routine takes a file name and returns a pointer matrix (my own data structure) and my print routine is a void that takes a matrix as an argument and prints it to stdout. If the read fails to parse the input file (for any reason) it prints a message and returns NULL. Any routine calling the read routine needs to test for NULL to know if an error has occurred.

I tested the I/O routines by

generating a random file using the program provided in the assignment write up
reading it into a matrix data structure
printing out that data structure in the required format
comparing the print out (in a file) to the input file using the Linux diff command.

When you can read a matrix text file into a data structure that can be used for computation and you can print it back out again exactly as you read it in you have solved three problems. First, you know you can read and parse the input format, second you know you can generate output in the right format, and third you know you have an accurate internal matrix representation to use for computation.

You should also test error cases. Make sure that your read routine can detect when the file is corrupt or incomplete. What happens if the row or column number is negative? What happens if your row and column specifications say there should be 100 elements but there aren't enough? Too many?

Coding the Matrix Multiply Algorithm

Next, I wrote a simple matrix multiply function that uses my internal matrix data structure. It takes two arguments (an A matrix and a B matrix) and returns the product matrix (or NULL if there is an error).

I tested this function in two ways. First, I crafted (by hand) a few small test matrix files in the format recognized by the I/O routines. Then (again by hand) I worked out what the matrix product of these matrices should be and compared that to a print out of the product.

I also wrote the argument parsing code at this stage that I used for the other stages. Thus I could run

./simple-matrix-multiply -a a-matrix-file.txt -b b-matrix-file.txt

and get a printout of the product in the specified format. That argument parsing code I reused throughout the rest of the assignment.

I also ran a "leak test" at this stage. The matrix data structure is dynamically allocated (it needs to be given that its size comes from the file). In any C program you need to be very careful about making sure any memory that is allocated is ultimately deallocated. To do this, I wrote a program with a loop that goes forever. In that loop I

allocate two matrices of fairly large size
fill them in with random numbers
call the matrix multiply function to get the product
free the two matrices and the product
repeat

Notice that this program does no input or output. I run it and while it is running, in another window I run the Linux utility top and watch the RSS value. If it goes up continuously (even slowly) the code has a memory leak. Fix these early as they are hard to fix late (especially in a threaded code).

Coding a single-thread version of matrix multiply

My next step was to create a version of matrix multiply that would spawn the actually product computation in a thread. The purpose here is to work out how to pass the needed arguments to the thread and to get the synchronization between the master thread and worker threads set up.

Rather than modify the previous code, I made a copy and used it as a template for the new code. That way I could compare the output of a simple, unthreaded matrix multiply with that of one that does the multiply in a thread.

I also changed the matrix multiply function interface. Rather than have it allocate the C matrix, I changed it to take the C matrix as an argument. Then, the master thread needs to allocate all three matrices and to pass them in (via an argument structure) to the worker thread. Here is an example of my argument structure:

struct thread_args
{
	int id;			/* sequential thread ID */
	struct matrix *A;	/* A matrix */
	struct matrix *B;	/* B matrix */
	struct matrix *C;	/* C matrix holds the product */
};

The master thread then does the following

read in A matrix
read in B matrix
allocate C matrix (of the correct size)
allocate a thread_args structure
fill in the fields of the thread_args structure
pthread_create a thread that takes the thread_args structure
pthread_join with that thread
print out the C matrix
free A, B, and C matrix
exit

and the worker thread does

unmarshal the arguments (A, B, and C matrix)
do error checking
do matrix multiply of A * B putting result in C
return

For testing, I compared the output of this program to the output of the simple, unthreaded matrix multiply in the previous step. The results should match exactly. If they don't, there is an error.

Coding the Full Credit Threaded Version

At this point, I had the following working and tested pretty thoroughly:

I/O routines for converting matrix files to an internal matrix data structure and back again
a simple, unthreaded matrix multiply that could take inputs from files (using the I/O routines) and generate a matrix product that it prints out
a slightly more complicated version that uses a single thread to compute the product and that puts this product in a C matrix that the master thread allocates and frees

and I had defined and internal matrix structure and an argument structure for the single-thread version.

My next step was to use the single-threaded version as the basis for a multi-threaded version that does row-wise partitioning (i.e. a strip decomposition). To do so, I started by modifying the argument structure to include a starting row and a row count. Each thread, then, will start at the starting row specified for it and do the matrix multiply algorithm for the number of rows (consecutively) to create the C matrix. It is the job of the master thread specify the exact starting row and the exact number of rows each thread should use. Since my previous version computes all of the C matrix, the modification to the worker thread was to restrict it only to work on its "strip" of the C matrix defined by the starting row and the row count.

The master thread, then, had to be changed to do all of the partitioning. In particular it had to

take an additional argument from the command line indicating how many threads to use
divide the rows of the C matrix up by the number of threads
make sure that each row belongs to exactly one thread (taking care of the case where the number of threads does not divide the number of rows evenly)
make sure that the row counts across threads are as evenly balanced as possible (i.e. the remainder rows, if there are any, are distributed as evenly as possible among the threads)
allocate an argument structure for each thread
fill it in with that threads specific information and pointers to the A,B, and C matrix
spawn each thread using pthread_create
join with each thread using pthread_join
check for any error conditions
print out the C matrix when all threads have completed
free all dynamically allocated data structures

It is possible to do a create followed immediately by a join, one at a time, but then there would be no parallel speed-up. Thus my code creates all threads before it joins with any threads.

I tested this version for correctness against the previous two versions using the test matrices. I also ran some speed-up tests and some error tests using random matrices (large and small).

I also added a print statement to the worker threads so each thread would print out exactly what elements it was producing in the C matrix. For small examples, I checked to make sure that each element is produced exactly once and that the threads were each computing a separate strip of the C matrix.

Finally I did some error testing to make sure that various error conditions were being handled correctly both in the worker threads and in the master thread. For example, I tested to make sure that the code worked when the number of rows in the C matrix is smaller than the number of threads specified on the command line. I also tested to make sure that I got the same (correct) answer when the number of threads is changed, thereby causing a different set of remainders when the row count is divided. For example, I generated a 100 x 100 C matrix and did a 10 thread run (each strip should be 10 rows wide). Then I did a 9, 8, 7, 6, 5, 4, 3, 2, and 1 thread run using the same input files. The times are different, but the answer needs to be exactly the same for each run.

Coding the Extra Credit Rectangle Version

Once the Full Credit version was working, I modified it to handle partitioning of both dimensions. The tricky parts here are to handle all of the remainder cases so that the rectangles are nearly the same area, that they fit together precisely to tessellate the C matrix, and that all threads get used.

Again this version has to get exactly the same answers as the previous three versions when run with different thread counts.

The difference between good and great

A good solution is one that gets full credit. A great solution is one that gets full credit and shows the craftsmanship in your work. Even after I had the solutions working I went back to make sure that there were clean and well written so that the TAs (who will use them to help you complete this assignment) could understand what they do. The more you can help someone reading your code understand your logic, the better your code even when it works properly.