CS140 Final Project: Game of Life (Cilk++ and MPI)

The object of this problem is to compare Cilk++ and MPI implementations of a cellular automaton called the Game of Life. You can debug your code on any machine you like, but you should do performance runs of the Cilk++ and MPI versions on Triton, and include thorough experimental results comparing the runtime and scaling of the two versions.

For your MPI implementation (which should run either on a single Triton node or across multiple nodes), you should experiment with different data layouts and configurations of ghost cells to find the most efficient for large boards. You should report on the scaling of your MPI code both running just on the cores of a single Triton node (where you can compare it directly to Cilk++), and for larger boards running on more cores using multiple Triton nodes.

For Cilk++ (on the cores of a single Triton node), the efficiency issues are more subtle, since the role of data locality is complicated when all the processors share the same main memory. Do whatever experiments and comparisons you can, and write a report giving your conclusions.

As originally defined, Life takes place on an infinite two-dimensional grid of squares, each of which is either empty or occupied by an organism. Each grid square has eight neighbors: two horizontal, two vertical, and four diagonal. Time moves in discrete steps called "generations". At each generation, the organisms are born, live, and die, according to the following rules.

If an empty square has exactly three occupied neighbors, a new organism is born in that square and it becomes occupied.
If an occupied square has exactly two or exactly three occupied neighbors, it remains occupied.
If an occupied square has more than three occupied neighbors, its organism dies of overcrowding and the square becomes empty.
If an occupied square has fewer than two occupied neighbors, its organism dies of loneliness and the square becomes empty.

The Game of Life was defined by the mathematician John Horton Conway; it became well-known when Martin Gardner wrote about it in Scientific American in 1970. You can find a lot more about it on the web, for example here. Many computational simulations of physical phenomena have the same structure as Life (but usually with more complicated rules): space is modelled as a two- or three-dimensional grid of cells; time is modelled by individual, discrete ticks of a clock (or generations); and at each clock tick the state of each cell is updated depending on the previous state of the cell and its neighbors.

One immediate question is how to simulate an infinite grid of cells with a finite computer. For this project you will use a finite n-by-n array of cells, with n as large as possible, and you will wrap the grid around at the top and bottom and sides, forming a torus. That is, you will consider the rightmost grid squares to have the leftmost grid squares as their right-hand neighbors, and similarly the top squares will be neighbors of the bottom squares. In the lingo of partial differential equations, you are imposing "periodic boundary conditions".

Life is pretty simple to write as a sequential program; there's a Matlab code here. (The Matlab code includes a harness with a data generator and validator; you should use it to check the correctness of your code.)

Making Life run efficiently in distributed memory in MPI is, not surprisingly, a matter of two things: data distribution and communication. For data distribution, you will probably first distribute the array of cells across the processors by rows. At least in theory, a two-dimensional block distribution might do less communication for large sizes. You will probably want to do experiments to see whether this is true in practice.

For communication, you may want to consider so-called "ghost cells", in which the part of the array assigned to each processor includes a copy of the first layer of cells assigned to each adjacent processor.

A more sophisticated version of Life might exploit the sparsity of the array, avoiding computation and storage for at least some of the large areas of empty cells. You can also imagine doing load-balancing dynamically, where every now and then you redistribute the data so that every processor is doing about the same amount of work. (This is tricky because it takes a lot of expensive communication to redistribute the data.) You don't have to experiment with sparsity and load balancing for this project, but you may if you're looking for something extra to do.