This exercise is to understand GPU programming and design an implementation of global reduction on MPI processes.
Q1: Running Sample GPU code at Comet
module load cuda makeTo run, submit the execution job through sbatch.
make submitdotprod
What to do
partial_c in the main function
after calling dot() function through GPU threads.
List a formula for the value of
partial_c[i] where 0<=i < blocksPerGrid.
Q2: SPMD code to implement MPI reduction function
Q2.1 Write pseudo code to implement the following MPI reduce primitive using send/receive operations on a tree-based summation structure similar to that of the above dotprod.cu GPU code. The MPI lecture slides also illustrate such a structure. Explain the number of parallel communication steps required to complete this global reduction.
MPI_Reduce(
void* send_data,
void* recv_data,
int count,
MPI_Datatype datatype,
MPI_Op op,
int root,
MPI_Comm communicator)
You can ignore MPI communicator and assume the operator op is for global summation,
count is 1, datatype is an integer, the number of processes is a power of 2, and root is 0. Only
send/receive operations can be used for process communication.
Q2.2: Global reduction with a topology constraint. We change the requirement of Q2.1 so that 1) the underlying network that connects machines is a 2^k * 2^k mesh architecture. MPI Process ID n is running on a machine numbered as (i,j) in this mesh architecture where n= i*2^k +j. Notice there is a total of 2^(2k) machines.
You are allowed to only send a message between two machines which are connected each other in the above mesh architecture. Write pseudo SPMD code to implement MPI_Reduce under such a topology requirement and the number of parallel communication steps required is within 2*2^k.