CS140 Final Project: Conjugate Gradient Solver

The object of this project is to write a parallel subroutine to multiply a sparse matrix by a dense vector (yielding another dense vector), and then to use that "matvec" routine in a routine to solve a symmetric positive definite system of linear equations by the conjugate gradient method, also known as "CG". You can choose one of two different directions to focus on:

Option 1: CG for numerical scientific simulation in distributed memory. Here, you'll write an MPI code to do CG, which is relevant to very large-scale numerical scientific simulations in which the computational grid is much too large to fit on one node of a machine.
Option 2: CG for graph analysis (the Laplacian World Championships). Here, you'll write a single-node program to do CG on problems that come from very unstructured graphs, and specialize it in several ways to solving so-called "graph Laplacian" systems. This computation is relevant to various applications in graph analytics.

Introduction: Conjugate Gradient

CG is a spiffy algorithm for solving a linear equation system of the form Ax = b for x, where A is a known (sparse) matrix of coefficients, b is a known right-hand-side vector, and x is an unknown vector. CG is newer and usually a lot faster than the Jacobi relaxation or SOR methods described in the book, provided the matrix A is symmetric and has positive eigenvalues. The theory behind CG, while mathematically very pretty, is too far afield from this course for us to go into. (If you want to go into the theory here is a good description.)

All you need to know for this project is what the algorithm does, which is pretty simple. CG is an iterative method, which means that it starts out with a guess at the value of x (usually you start with the vector of all zeros), and then it does "iterations", one after another, improving the guess at each one, until the guess is close enough. The main work of an iteration involves the multiplication of one (dense) vector by the (sparse) matrix A, and two computations of the dot product of two vectors. The algorithm (and its data structures and methods) is outlined in these slides. Also, there is a sequential Matlab code for CG here.

The matvec routine is partly an exercise in data structures for irregular, sparse objects. You'll use a "Compressed Sparse Row" data structure that can store any sparse matrix in space proportional to the number of nonzero elements in the matrix. A description of the data structure is here.

There is a sequential Matlab code on the course web site. The Matlab code includes a harness with a data generator and validator, which you can use to check the correctness of your code. The data generator can produce an identity matrix, or the matrix whose graph is a regular k-by-k-by-k grid corresponding to the discretized Poisson equation in three dimensions. The Matlab routine "cgsolve.m" is the conjugate gradient solver. The sparse matvec is built into Matlab; it is just the line "Ad = A*d;"

Option 1: MPI distributed-memory CG

In this option, your goal is to get CG to run efficiently on the very largest matrix you can, just like the bone density and earthquake simulations we discussed in class. You don't have to use the actual matrices from those simulations; you can generate synthetic data as described below, instead, to test your code.

In distributed memory, each processor will store some of the rows of the matrix; with p processors and an n-by-n matrix, each processor will store about n/p rows. Each processor will use the Compressed Sparse Row data structure for its own rows. You may want to rearrange the data, so that the rows on one processor correspond to mesh points near each other, in order to make the sequence of matvecs faster.

You will generate the matrix data distributed over the processors by rows, with n/p rows of an n-by-n matrix on each of p processors. Generating realistic data in parallel can be a bit tricky. You should first implement and test the trivial case of a diagonal matrix, for example with A(i,i) = i for all i, just as a reality check. Then you should write a parallel generator for the "model problem", which is the k-by-k 2D grid of n=k^2 equations in n=k^2 unknowns. Every row of the matrix has a +4 on the diagonal, and most of the rows also have four -1's, one place before and after the diagonal and k places before and after the diagonal. However, the k equations for points on the leftmost side of the grid are missing the -1 one place before the diagonal; the k equations for the rightmost points are missing the -1 one place after the diagonal; and the k equations for top points and the k equations for bottom points are missing the -1 k places before or k places after the diagonal. It's not too hard to write code for each processor to generate its own n/p rows of this matrix, checking the special cases, but you'll want to debug carefully and print lots of examples to make sure you're getting the right matrix. You can assume that k is divisible by p if that makes it easier. Also, you're welcome to come talk to us about various ways of getting sample data for this project.

The data distribution and communication for this problem are up to you. You will probably want to think about not communicating more of the vector than necessary in the matvec. You may want to experiment with rearranging the rows and columns of the matrix to reduce the amount of communication; for fairly small matrices this is unlikely to help but it might be useful for the biggest matrices you can run. You should be able to do pretty large problems on a parallel machine; on my laptop, the sequential Matlab code takes about 2 minutes for a 1,000,000-by-1,000,000 matrix.

Option 2: Shared-memory CG for singular graph Laplacians

For this option, you'll run on the processors of a single node, using shared memory. I recommend writing your program in Cilk, but you may also explore using OpenMP or Intel Threaded Building Blocks or another C-based shared-memory language.

In this option, the original data is actually an undirected graph. The matrix is the so-called Laplacian matrix of the graph, which is constructed as follows. Suppose the graph has n vertices, numbered 1 through n. The Laplacian matrix A has the vertex degrees on the diagonal, so A(i,i) is the number of edges incident on vertex i. The off-diagonal elements are all either -1 or 0, and show where the edges of the graph are: If (i,j) is an edge of the graph, then A(i,j) = A(j,i) = -1; otherwise, A(i,j) is zero. Here's a paper with a lot of good information (more than you'll need) on graph Laplacians.

There are a number of differences between this version of CG and the one in Option 1. One is, of course, that the whole matrix lives in a single data structure. That makes things easier to program, but maybe harder to get good performance. An alternative to the Compressed Sparse Row data structure is a Compressed Sparse Blocks data structure, as described in this paper, with one implementation here.

Another difference in this version is that the matrix will be very irregularly structured; different rows will have very different numbers of nonzeros in them, depending entirely on what the input graph looks like.

A third difference is a mathematical detail that you'll have to handle in your code: The Laplacian matrix is actually singular (because its rows all sum to zero). Therefore you can only solve Ax = b if the vector b is in the range of the matrix; this means that the elements of b also have to add up to zero. In CG, all the intermediate vectors (the x's, r's, and d's) need to add up to zero too. Floating-point round-off error can make them creep away from zero, so you'll need to correct them as you compute them. The Matlab statement x = x - sum(x)/n projects the vector x orthogonally to have zero sum; you'll need to do this operation in your parallel code.

There are various ways to get sample graphs to try out your code on, including the RMAT generator and the graphs in the Florida Sparse Matrix collection. To keep things simpler numerically, you should use only connected graphs; if you have a graph that isn't connected, you can (offline) find its largest connected component and just use that to create a Laplacian matrix.

There's a contest called the Laplacian World Championships for solving linear systems on Laplacian matrices. Entries will be due about a year from now, so if you want to enter you have plenty of time (but you have to finish your final project by the end of March!). There's a talk on the Laplacian World Championships here.