240A Winter 2016 HW2 GPU

240A Winter 2016 HW2 GPU and SPMD programming

Update on 2/19: add the number of communication steps in Q2.
Update on 2/20: add the compilation instruction and update the number reporting.

This exercise is to understand GPU programming and design an implementation of global reduction on MPI processes.

Q1: Running Sample GPU code at Comet

Slides on GPU are here.
The GPU sample code by Dr. Burak Himmetoglu with Makefile and submission scripts is available at Comet under /home/tyang/cs240sample/GPU. To compile this code at the Comet login node, type:
```
module load cuda
make
```
To run, submit the execution job through sbatch.
```
make submitdotprod
```

What to do

Run the sample dotprod.cu and report the problem size, execution time and megaflop numbers accomplished.

Explain how many GPU threads are used during the execution of this program, and explain the array values of partial_c in the main function after calling dot() function through GPU threads. List a formula for the value of partial_c[i] where 0<=i < blocksPerGrid.

Q2: SPMD code to implement MPI reduction function

Q2.1 Write pseudo code to implement the following MPI reduce primitive using send/receive operations on a tree-based summation structure similar to that of the above dotprod.cu GPU code. The MPI lecture slides also illustrate such a structure. Explain the number of parallel communication steps required to complete this global reduction.


MPI_Reduce(
    void* send_data,
    void* recv_data,
    int count,
    MPI_Datatype datatype,
    MPI_Op op,
    int root,
    MPI_Comm communicator)

You can ignore MPI communicator and assume the operator op is for global summation, count is 1, datatype is an integer, the number of processes is a power of 2, and root is 0. Only send/receive operations can be used for process communication.

Q2.2: Global reduction with a topology constraint. We change the requirement of Q2.1 so that 1) the underlying network that connects machines is a 2^k * 2^k mesh architecture. MPI Process ID n is running on a machine numbered as (i,j) in this mesh architecture where n= i*2^k +j. Notice there is a total of 2^(2k) machines.

You are allowed to only send a message between two machines which are connected each other in the above mesh architecture. Write pseudo SPMD code to implement MPI_Reduce under such a topology requirement and the number of parallel communication steps required is within 2*2^k.

Additional reference

Additional GPU examples can be found here.
Comet website has a section on running GPU code.
NVIDIA CUDA C Programming Guide.