Introduction To GPU programming with CUDA

GPU's (Graphical Processing Units) are specific pieces of hardware designed to store and modify images before they are sent to a display. Controlling a display requires a lot of memory to hold the screen, and a lot of bandwidth to send the screen. Over time GPU's have assumed more and more responsibility. Today GPU's have a lot of flexibility and insane instruction throughput, which allows them to solve many problems beyond displaying a video or rendering a video game. Some potential applications include: Graph Computations, Deep Learning, and Numerical Simulations.

To highlight the difference of power between a CPU and GPU:

CPU: i7-8700k
    Price: $380
    Float FLOPS: .82 TeraFlops
    Double FLOPS: .41 TeraFlops
GPU: GTX 1070
    MSRP: $380
    Float FLOPS: 5.7 TeraFlops
    Double FLOPS: .18 TeraFlops

Architecture VS. Programming Model
CUDA Architecture
CUDA Code
Tutorial: Vector Addition
Tutorial: Matrix Multiplication
Code Optimizations

Architecture VS. Programming Model

Note that the CUDA physical architecture, and the CUDA programming model are different things.

The CUDA physical architecture describes how the chip is internally connected and how it executes instructions. These concepts are very similar to the material presented in CS154.

The CUDA programming model describes the API you can use to write code for the GPU. These concepts are very similar to material presented in CS140 (and CS170).

CUDA Architecture

Architecture Terminology

Streaming Multiprocessor (SMX) -- A complete processing unit. Contains its own warp scheduler, memory cache, and ALU's (Cuda Cores).
Thread -- A thread of execution. Has its own unique registers and memory.
Cuda Core -- A fully pipelined float and integer ALU
Warp -- A group of 32 Threads with the same program.
Warp Scheduler -- Identifies non-blocked warps and then identifies two independent instructions for that warp. Sends these instructions to Instruction Dispatch Unit
Instruction Dispatch Unit -- Runs the warp instructions on the CUDA Cores.
Registers -- Stores a numerical value that can be immediately sent to an ALU (no load operation needed).
Shared Memory -- Memory block inside an SMX that can be used by any thread on the SMX
L1 Cache -- Memory block inside an SXM used to cache thread registers
DRAM -- Off-chip memory. It is still located on the GPU card and is seperate from CPU RAM.
Read-Only Cache -- Memory that is read-only to threads. Also called "Texture Memory"
L2 Cache -- Memory on GPU chip (and outside the SMX's) used to cache Read-Only Memory and L1 Cache.
Host RAM -- Memory located off the GPU, accessible directly by the CPU.(what we normally consider memory/ram)

For example, K80 GPU has 26 SMX units, each with 4 warp schedulers and 192 cuda cores. This gives the GPU a combined 4992 Cuda Cores which theoretically allows it to execute 4992 threads at the same time! To reach this maximum computational speed, it keeps many more than 4992 threads in memory and only executes threads that are ready (not blocked on load/store operations). Groups of 32 Threads are gathered into a Warp for their lifetime. These Warps wait in memory until they are scheduled by Warp Schedulers, of which there are 4 per SMX. Once scheduled for an instruction, these 32 threads are assigned to 32 of the 192 cuda cores, the instruction executes, and then the Warp waits until scheduled again by a Warp Scheduler.

Said simply: The gpu takes your threads, groups them together into warps, and runs your warps when it damn well pleases.

In Fig.1 you can see the architecture of an SMX Core. Since an SMX is complete contained, a GPU could have as little as 1 or as many as 15 depending on the price of the GPU.

In Fig.2 you can see 15 SMX units on the chip of a GPU. Notice that DRAM is located off the chip. DRAM is accessed via the memory controllers, Host RAM is accessed via the PCI-E 3.0 Interface.

This image drives home how memory moves from DRAM to a Thread's registers in an SMX.

CUDA Code

Code Terminology

CUDA SDK -- A set of tools, libraries and headers used to write and compile cuda code.
nvcc -- The Nvidia c compiler
Compute Capabilities -- The architecture and required bytecode version for a particular CUDA GPU.
Kernel -- A c function written to run on a GPU
Host -- The CPU, since the CPU controls the GPU
Device -- The GPU
Thread -- A thread of execution, with its own id, memory and control flow (program counter).
Thread block -- A group of threads that can communicate and synchronize. Each thread block should access independent memory locations of its siblings.
Local Memory -- Memory only available to one thread.
Shared Memory -- Memory shared within a thread block, can be accessed by any thread in the block.
Global Memory -- Global shared memory that stores the input and results of a kernel. Cannot be used mid-computation to share data between Thread Blocks.
Texture Memory -- Read-only memory. Used for constants and texture data that changes very rarely.

The Nvidia C Compiler (nvcc) outputes binaries that are compatible with an NVIDIA GPU. NVCC comes with the CUDA SDK which can usually be found under /usr/local/<cuda_version>. For Comet we will be using NVCC from CUDA 7.0 to create binaries for the K80 gpu's which have Compute Capability 3.7

Code for the GPU is defined in a special c function called a Kernel. A regular CPU function can be made into a GPU function using the "__global__" declaration specifier. The GPU kernel should usually be written to run correctly with an arbitrary number of blocks, and an arbitrary number of threads per block.

__global__
void my_first_do_nothing_kernel()
{
  //block index * number of blocks + thread index within block
  int my_unique_thread_id = blockIdx.x * blockDim.x + threadIdx.x;
}

int main(){
  //execute function on GPU with 2 blocks and 4 threads per block
  my_first_do_nothing_kernel<<<2,4>>>();
}

Kernels can be called from Host (aka CPU) code using foo<<<N, M>>>() syntax. N specifies how many Thread Blocks (aka Blocks) should used to run the function, and M specifies how many threads per block. When a kernel is called, each block is assigned to an SMX. Within the block, threads can communicate using the SMX's Shared Memory. However, blocks cannot communicate with each other. They must simply return their results to Global Memory and die. The return of a kernel can be thought as a barrier for blocks, allowing you to gather the data and start a new kernel. The diagram below shows an example of a kernel running. Notice how only inter-block threads communicate, and how blocks are scheduled on Threads. Since each block can't communicate with the others, its execution is independent which greatly simplifies scheduling complexity.

One of the benefits to the block and thread-block programming model is you don't care how many SMX's the GPU has. This allows one program to run on a GPU with 1 SMX or 15. In the above example, if we had 4 SMX's available instead of 2, we could finish the computation in one step.

More Details on CUDA Programming

CUDA Programming Guide

Introduction To GPU programming with CUDA

Table of Contents

Architecture VS. Programming Model

CUDA Architecture

Architecture Terminology

CUDA Code

Code Terminology

Tutorial: Vector Addition

Tutorial: Matrix Multiplication

Code Optimizations

More Details on CUDA Programming