However if the assignment seems a little daunting, then perhaps a discussion of the implementation strategy I used will be helpful.
In that vein, you should also read the text of the assignment carefully. And then, you should read it again -- carefully. Understanding the full assignment completely makes it easier to see how to plan and execute it. If you try and start before you know how it will end, you may wind up coding yourself into a corner. I've made this mistake more times than I can count when under a deadline. You should avoid this common pitfall by having a complete grasp of what is required before you begin and that grasp will only come from reading the assignment CAREFULLY.
Then, before you begin, read the lecture notes from the class on pthreads -- again, carefully. You will need to understand how pthreads creates and synchronizes between threads. If you don't quite get how that works (read through the code examples) keep reading and do some research until you are comfortable with the basics.
Therefore, your best strategy is to try and get the program working in stages. To do so, you are going to want to write modules and test routines at the same time. Each time you add a feature you should also write a function to test that feature.
Believe me. This is the way to become comfortable with large-scale projects that need to be completed on a deadline. You will go far if you can master this skill.
For the matrix multiply code I decomposed the development (including test code development) into the following stages (to be completed in order):
Here is the reasoning I used and some of the testing I did.
You will need (at least) one routine to read a matrix in and another to print a matrix out. You might need other routines for debugging, but I managed the assignment with just those two.
Writing the I/O routines first forces you to think about your data structures. If you followed my advice so far, you will notice that a matrix can be represented as a data structure with rows, columns, and space to hold the elements in row-major order. For example,
struct matrix { int rows; int columns; double *data; };Notices that to create a matrix using this structure you will need to use malloc() to allocated the space you need for the data dynamically. Similarly when you deallocate a matrix, you'll need to free that space.
The way I handle this situation in a C program is to write a constructor and a destructor for each of my internal data types. Thus I wrote a routine to allocate a matrix and another to free one that I use any place in the code where I need a matrix data structure.
In my code, the read routine takes a file name and returns a pointer matrix (my own data structure) and my print routine is a void that takes a matrix as an argument and prints it to stdout. If the read fails to parse the input file (for any reason) it prints a message and returns NULL. Any routine calling the read routine needs to test for NULL to know if an error has occurred.
I tested the I/O routines by
You should also test error cases. Make sure that your read routine can detect when the file is corrupt or incomplete. What happens if the row or column number is negative? What happens if your row and column specifications say there should be 100 elements but there aren't enough? Too many?
I tested this function in two ways. First, I crafted (by hand) a few small test matrix files in the format recognized by the I/O routines. Then (again by hand) I worked out what the matrix product of these matrices should be and compared that to a print out of the product.
I also wrote the argument parsing code at this stage that I used for the other stages. Thus I could run
./simple-matrix-multiply -a a-matrix-file.txt -b b-matrix-file.txtand get a printout of the product in the specified format. That argument parsing code I reused throughout the rest of the assignment.
I also ran a "leak test" at this stage. The matrix data structure is dynamically allocated (it needs to be given that its size comes from the file). In any C program you need to be very careful about making sure any memory that is allocated is ultimately deallocated. To do this, I wrote a program with a loop that goes forever. In that loop I
Rather than modify the previous code, I made a copy and used it as a template for the new code. That way I could compare the output of a simple, unthreaded matrix multiply with that of one that does the multiply in a thread.
I also changed the matrix multiply function interface. Rather than have it allocate the C matrix, I changed it to take the C matrix as an argument. Then, the master thread needs to allocate all three matrices and to pass them in (via an argument structure) to the worker thread. Here is an example of my argument structure:
struct thread_args { int id; /* sequential thread ID */ struct matrix *A; /* A matrix */ struct matrix *B; /* B matrix */ struct matrix *C; /* C matrix holds the product */ };The master thread then does the following
read in A matrix read in B matrix allocate C matrix (of the correct size) allocate a thread_args structure fill in the fields of the thread_args structure pthread_create a thread that takes the thread_args structure pthread_join with that thread print out the C matrix free A, B, and C matrix exitand the worker thread does
unmarshal the arguments (A, B, and C matrix) do error checking do matrix multiply of A * B putting result in C returnFor testing, I compared the output of this program to the output of the simple, unthreaded matrix multiply in the previous step. The results should match exactly. If they don't, there is an error.
My next step was to use the single-threaded version as the basis for a multi-threaded version that does row-wise partitioning (i.e. a strip decomposition). To do so, I started by modifying the argument structure to include a starting row and a row count. Each thread, then, will start at the starting row specified for it and do the matrix multiply algorithm for the number of rows (consecutively) to create the C matrix. It is the job of the master thread specify the exact starting row and the exact number of rows each thread should use. Since my previous version computes all of the C matrix, the modification to the worker thread was to restrict it only to work on its "strip" of the C matrix defined by the starting row and the row count.
The master thread, then, had to be changed to do all of the partitioning. In particular it had to
I tested this version for correctness against the previous two versions using the test matrices. I also ran some speed-up tests and some error tests using random matrices (large and small).
I also added a print statement to the worker threads so each thread would print out exactly what elements it was producing in the C matrix. For small examples, I checked to make sure that each element is produced exactly once and that the threads were each computing a separate strip of the C matrix.
Finally I did some error testing to make sure that various error conditions were being handled correctly both in the worker threads and in the master thread. For example, I tested to make sure that the code worked when the number of rows in the C matrix is smaller than the number of threads specified on the command line. I also tested to make sure that I got the same (correct) answer when the number of threads is changed, thereby causing a different set of remainders when the row count is divided. For example, I generated a 100 x 100 C matrix and did a 10 thread run (each strip should be 10 rows wide). Then I did a 9, 8, 7, 6, 5, 4, 3, 2, and 1 thread run using the same input files. The times are different, but the answer needs to be exactly the same for each run.
Again this version has to get exactly the same answers as the previous three versions when run with different thread counts.