Introduction To GPU programming with CUDA

GPU's (Graphical Processing Units) are specific pieces of hardware designed to store and modify images before they are sent to a display. Controlling a display requires a lot of memory to hold the screen, and a lot of bandwidth to send the screen. Over time GPU's have assumed more and more responsibility. Today GPU's have a lot of flexibility and insane instruction throughput, which allows them to solve many problems beyond displaying a video or rendering a video game. Some potential applications include: Graph Computations, Deep Learning, and Numerical Simulations.

To highlight the difference of power between a CPU and GPU:

CPU: i7-8700k
    Price: $380
    Float FLOPS: .82 TeraFlops
    Double FLOPS: .41 TeraFlops
GPU: GTX 1070
    MSRP: $380
    Float FLOPS: 5.7 TeraFlops
    Double FLOPS: .18 TeraFlops

Table of Contents

Fig.1 - SMX Core

Architecture VS. Programming Model

Note that the CUDA physical architecture, and the CUDA programming model are different things.

The CUDA physical architecture describes how the chip is internally connected and how it executes instructions. These concepts are very similar to the material presented in CS154.

The CUDA programming model describes the API you can use to write code for the GPU. These concepts are very similar to material presented in CS140 (and CS170).

CUDA Architecture

Architecture Terminology

For example, K80 GPU has 26 SMX units, each with 4 warp schedulers and 192 cuda cores. This gives the GPU a combined 4992 Cuda Cores which theoretically allows it to execute 4992 threads at the same time! To reach this maximum computational speed, it keeps many more than 4992 threads in memory and only executes threads that are ready (not blocked on load/store operations). Groups of 32 Threads are gathered into a Warp for their lifetime. These Warps wait in memory until they are scheduled by Warp Schedulers, of which there are 4 per SMX. Once scheduled for an instruction, these 32 threads are assigned to 32 of the 192 cuda cores, the instruction executes, and then the Warp waits until scheduled again by a Warp Scheduler.

Said simply: The gpu takes your threads, groups them together into warps, and runs your warps when it damn well pleases.

In Fig.1 you can see the architecture of an SMX Core. Since an SMX is complete contained, a GPU could have as little as 1 or as many as 15 depending on the price of the GPU.

In Fig.2 you can see 15 SMX units on the chip of a GPU. Notice that DRAM is located off the chip. DRAM is accessed via the memory controllers, Host RAM is accessed via the PCI-E 3.0 Interface.

Fig.2 - A Kepler Die with 15 SMX's

This image drives home how memory moves from DRAM to a Thread's registers in an SMX.

Fig.3 - Connectivity Diagram of Memory

CUDA Code

Code Terminology

The Nvidia C Compiler (nvcc) outputes binaries that are compatible with an NVIDIA GPU. NVCC comes with the CUDA SDK which can usually be found under /usr/local/<cuda_version>. For Comet we will be using NVCC from CUDA 7.0 to create binaries for the K80 gpu's which have Compute Capability 3.7

Code for the GPU is defined in a special c function called a Kernel. A regular CPU function can be made into a GPU function using the "__global__" declaration specifier. The GPU kernel should usually be written to run correctly with an arbitrary number of blocks, and an arbitrary number of threads per block.

__global__
void my_first_do_nothing_kernel()
{
  //block index * number of blocks + thread index within block
  int my_unique_thread_id = blockIdx.x * blockDim.x + threadIdx.x;
}

int main(){
  //execute function on GPU with 2 blocks and 4 threads per block
  my_first_do_nothing_kernel<<<2,4>>>();
}

Kernels can be called from Host (aka CPU) code using foo<<<N, M>>>() syntax. N specifies how many Thread Blocks (aka Blocks) should used to run the function, and M specifies how many threads per block. When a kernel is called, each block is assigned to an SMX. Within the block, threads can communicate using the SMX's Shared Memory. However, blocks cannot communicate with each other. They must simply return their results to Global Memory and die. The return of a kernel can be thought as a barrier for blocks, allowing you to gather the data and start a new kernel. The diagram below shows an example of a kernel running. Notice how only inter-block threads communicate, and how blocks are scheduled on Threads. Since each block can't communicate with the others, its execution is independent which greatly simplifies scheduling complexity.

One of the benefits to the block and thread-block programming model is you don't care how many SMX's the GPU has. This allows one program to run on a GPU with 1 SMX or 15. In the above example, if we had 4 SMX's available instead of 2, we could finish the computation in one step.

Tutorial: Vector Addition

Source Code!

This example is fairly straight forward, it takes two vectors and adds them together.

Tutorial: Matrix Multiplication

Source Code!

This example is more complex. It takes two matrices and multiplies them together. C = A*B. Along with 1-D thread indexes, CUDA supports 2-D and 3-D. For this example I use 2-D indexing since each thread outputs one element of C, an N*N matrix.

Code Optimizations

More Details on CUDA Programming