MPI is a message-passing standard widely used for developing high-performance parallel applications and this project addresses performance portability of MPI code on shared memory machines and on a cluster of shared memory machines.
Because of the restriction in the MPI computation model, conventional implementations on shared memory machines map each MPI node to an OS process, which suffers serious performance degradation in the presence of multiprogramming, especially when a space/time sharing policy is employed in OS job scheduling. In this project, we study compile-time and run-time support for MPI by using threads and demonstrate our optimization techniques for executing a large class of MPI programs written in C. The compile-time transformation adopts thread-specific data structures to eliminate the use of global and static variables in C code. The run-time support includes an efficient point-to-point communication protocol based on a novel lock-free queue management scheme. Our experiments on an SGI Origin 2000 show that our MPI prototype called TMPI using the proposed techniques is competitive with SGI's native MPI implementation in a dedicated environment, and it has significant performance advantages with up to a 23-fold improvement in a multiprogrammed environment. A paper on this topic appeared in PPoPP'99.
Initially we map each MPI node to a kernel thread. Kernel threads have context switch cost higher than user-level threads and this leads to longer spinning time requirement during MPI synchronization. Recently we have developed an adaptive two-level thread scheme which minimizes kernel-level context switches for MPI. This scheme also exposes thread scheduling information at user-level, which allows us to design an adaptive event waiting strategy to minimize CPU spinning and exploit cache affinity during synchronization. Our experiments using synthetic workloads show that the MPI system based on the proposed techniques has great performance advantages over the previous version of TMPI and the SGI MPI implementation in multiprogrammed environments.
We have also investigated the design and implementation of a thread-based MPI implementation on SMP clusters. Our study indicates that with a proper design for threaded MPI execution, both point-to-point and collective communication performance can be improved substantially, compared to a process-based MPI implementation in a cluster environment. The key optimizations we propose are hierarchy-aware communication algorithms for threaded MPI execution and a thread-safe network device abstraction that features an event-driven style synchronization interface and separated point-to-point and collective communication channels. Event-driven synchronization among MPI nodes takes advantage of lightweight threads and eliminates the spinning overhead caused by busy polling. Channel separation allows more flexible and efficient design of collective communication primitives.