240A Winter 2013 HW1

240A Winter 2016 HW1: Parallel matrix multiplication

You will port and parallelize code for matrix multiplication which is a basic building block in many scientific computations. The most naive code to multiply square matrices is:

  for i = 1 to n
    for j = 1 to n
      for k = 1 to n
        C[i,j] = C[i,j] + A[i,k] * B[k,j]
      end
    end
  end

Initialize A[i,j]= i+j. B[i,j]=i*j.

There are 3 options to implement the sequential code.

1) The naive approach listed above.
2) Use a submatrix partitioning (blocked version).
3) Use a BLAS3 dgemm library function.

The sample C/C++ code for above 3 options with timing and test driver is available from this tar file . These 3 options implement the core function

  void square_dgemm( int n, double *A, double *B, double *C )

The matrices are stored in column-major order, i.e. entry C_ij is stored at C[i+j*M]. The tar file includes:

dgemm-naive.cpp: A naive implementation of matrix multiply using three loops,
dgemm-blocked.cpp: A simple blocked implementation of matrix multiply,
dgemm-blas.cpp: A wrapper for using BLAS3 dgemm library function,
benchmark.cpp: The driver program that measures the runtime and verifies the correctness.

You can call BLAS dgemm() using Intel MKL library at Comet cluster. A small include file change for using dgemm() is here. A sample makefile for linking MKL is here.

What to do

Port the code to Comet with the Intel MKL library. Report the megaflops numbers using the above 3 options with n= 100, 200, 400, 800, and 1600 on one core.
Parallelize the naive sequential program using openMP. Report megaflops numbers, parallel time, and speedup for n=1600 with 4, 8, 16, 24 cores.
Parallelize the naive program using MPI. Process 0 collects the final results from all processes. Report megaflops numbers, parallel time, and speedup for n=1600 with 4, 8, 16, 32 processes (processors).
Write the optimized pthreads code for parallel matrix multiplication so that you can obtain the "best" megaflops performance for n=1600 running on a cluster node with 8 and 24 cores. Report megaflops numbers and parallel time accomplished.

What to submit

Submit a report along with code. For each problem, explain if performance numbers obtained are reasonable. Explain your design and justify why your design/implementation have optimized the megaflops performance. The report also contains the instruction on how to compile, and how to test. The code must contain a sampling mechanism to check that the multiplied results are correct.

Reference links :

Comet information and sample code/Makefile for using MPI and openMP
Intel MKL information and matrix multiplication example can be found here.