240A Winter 2013 HW1
240A Winter 2016 HW1: Parallel matrix multiplication
You will port and parallelize code for matrix multiplication which is a
basic building block in many scientific
computations. The most naive code to multiply square matrices is:
for i = 1 to n
for j = 1 to n
for k = 1 to n
C[i,j] = C[i,j] + A[i,k] * B[k,j]
end
end
end
Initialize A[i,j]= i+j. B[i,j]=i*j.
There are 3 options to implement the sequential code.
-
1) The naive approach listed above.
-
2) Use a submatrix partitioning (blocked version).
-
3) Use a BLAS3 dgemm library function.
The sample C/C++ code for above 3 options with timing and test driver is available from
this tar file . These 3 options implement the core function
void square_dgemm( int n, double *A, double *B, double *C )
The matrices are stored in column-major order, i.e. entry Cij is stored at C[i+j*M].
The tar file includes:
|
- dgemm-naive.cpp
- A naive implementation of matrix multiply using three loops,
- dgemm-blocked.cpp
- A simple blocked implementation of matrix multiply,
- dgemm-blas.cpp
- A wrapper for using BLAS3 dgemm library function,
- benchmark.cpp
- The driver program that measures the runtime and verifies the correctness.
|
You can call BLAS dgemm() using Intel MKL library at Comet cluster. A small include file change
for using dgemm() is here.
A sample makefile for linking MKL is here.
What to do
- Port the code to Comet with the Intel MKL library.
Report the megaflops numbers using the above 3 options with n= 100, 200, 400, 800, and 1600 on one core.
- Parallelize the naive sequential program using openMP.
Report megaflops numbers, parallel time, and speedup for n=1600 with 4, 8, 16, 24 cores.
- Parallelize the naive program using MPI.
Process 0 collects the final results from all processes.
Report megaflops numbers, parallel time, and speedup for n=1600 with 4, 8, 16, 32 processes (processors).
- Write the optimized pthreads code for parallel matrix multiplication
so that you can obtain the "best" megaflops performance for n=1600 running on a cluster node with 8 and 24 cores.
Report megaflops numbers and parallel time accomplished.
What to submit
- Submit a report along with code.
For each problem, explain if performance numbers obtained are reasonable.
Explain your design and justify why your design/implementation have optimized the megaflops performance.
The report also contains the instruction on how to compile, and how to test.
The code must contain a sampling mechanism to check that the multiplied results are correct.
Reference links :