For your MPI implementation (which should run either on a single Triton node or across multiple nodes), you should experiment with different data layouts and configurations of ghost cells to find the most efficient for large boards. You should report on the scaling of your MPI code both running just on the cores of a single Triton node (where you can compare it directly to Cilk++), and for larger boards running on more cores using multiple Triton nodes.
For Cilk++ (on the cores of a single Triton node), the efficiency issues are more subtle, since the role of data locality is complicated when all the processors share the same main memory. Do whatever experiments and comparisons you can, and write a report giving your conclusions.
As originally defined, Life takes place on an infinite two-dimensional grid of squares, each of which is either empty or occupied by an organism. Each grid square has eight neighbors: two horizontal, two vertical, and four diagonal. Time moves in discrete steps called "generations". At each generation, the organisms are born, live, and die, according to the following rules.
One immediate question is how to simulate an infinite grid of cells with a finite computer. For this project you will use a finite n-by-n array of cells, with n as large as possible, and you will wrap the grid around at the top and bottom and sides, forming a torus. That is, you will consider the rightmost grid squares to have the leftmost grid squares as their right-hand neighbors, and similarly the top squares will be neighbors of the bottom squares. In the lingo of partial differential equations, you are imposing "periodic boundary conditions".
Life is pretty simple to write as a sequential program; there's a Matlab code here. (The Matlab code includes a harness with a data generator and validator; you should use it to check the correctness of your code.)
Making Life run efficiently in distributed memory in MPI is, not surprisingly, a matter of two things: data distribution and communication. For data distribution, you will probably first distribute the array of cells across the processors by rows. At least in theory, a two-dimensional block distribution might do less communication for large sizes. You will probably want to do experiments to see whether this is true in practice.
For communication, you may want to consider so-called "ghost cells", in which the part of the array assigned to each processor includes a copy of the first layer of cells assigned to each adjacent processor.
A more sophisticated version of Life might exploit the sparsity of the array, avoiding computation and storage for at least some of the large areas of empty cells. You can also imagine doing load-balancing dynamically, where every now and then you redistribute the data so that every processor is doing about the same amount of work. (This is tricky because it takes a lot of expensive communication to redistribute the data.) You don't have to experiment with sparsity and load balancing for this project, but you may if you're looking for something extra to do.