240A Winter 2016 HW2 GPU

240A Winter 2016 HW2 GPU and SPMD programming

Update on 2/19: add the number of communication steps in Q2.
Update on 2/20: add the compilation instruction and update the number reporting.

This exercise is to understand GPU programming and design an implementation of global reduction on MPI processes.

Q1: Running Sample GPU code at Comet

What to do

  • Run the sample dotprod.cu and report the problem size, execution time and megaflop numbers accomplished.

  • Explain how many GPU threads are used during the execution of this program, and explain the array values of partial_c in the main function after calling dot() function through GPU threads. List a formula for the value of partial_c[i] where 0<=i < blocksPerGrid.

    Q2: SPMD code to implement MPI reduction function

    Q2.1 Write pseudo code to implement the following MPI reduce primitive using send/receive operations on a tree-based summation structure similar to that of the above dotprod.cu GPU code. The MPI lecture slides also illustrate such a structure. Explain the number of parallel communication steps required to complete this global reduction.

    
    MPI_Reduce(
        void* send_data,
        void* recv_data,
        int count,
        MPI_Datatype datatype,
        MPI_Op op,
        int root,
        MPI_Comm communicator)
    
    You can ignore MPI communicator and assume the operator op is for global summation, count is 1, datatype is an integer, the number of processes is a power of 2, and root is 0. Only send/receive operations can be used for process communication.

    Q2.2: Global reduction with a topology constraint. We change the requirement of Q2.1 so that 1) the underlying network that connects machines is a 2^k * 2^k mesh architecture. MPI Process ID n is running on a machine numbered as (i,j) in this mesh architecture where n= i*2^k +j. Notice there is a total of 2^(2k) machines.

    You are allowed to only send a message between two machines which are connected each other in the above mesh architecture. Write pseudo SPMD code to implement MPI_Reduce under such a topology requirement and the number of parallel communication steps required is within 2*2^k.

    Additional reference