GPU Architecture
Warps, SMs, memory hierarchy — how GPUs achieve massive parallelism for AI workloads.
Why GPUs?
A CPU core is optimized for low-latency serial execution. A GPU trades single-thread performance for massive throughput — thousands of simple cores executing in lockstep.
| CPU (high-end) | GPU (high-end) | |
|---|---|---|
| Cores | 24–96 | 10 000–18 000 CUDA cores |
| Clock | 4–5.5 GHz | 1.5–2.5 GHz |
| Memory BW | ~100 GB/s | ~3 TB/s (HBM3e) |
| Peak FP32 | ~2 TFLOPS | ~80+ TFLOPS |
Streaming Multiprocessors (SMs)
An NVIDIA GPU is organized into SMs, each containing:
- CUDA cores (FP32/INT32 units)
- Tensor cores (matrix-multiply-accumulate)
- Shared memory / L1 cache (configurable, typically 128–228 KB)
- Warp schedulers
Warps and SIMT
Threads are grouped into warps of 32. All threads in a warp execute the same instruction — Single Instruction, Multiple Threads (SIMT). Divergent branches cause warp serialization.
Warp 0: [T0 T1 T2 ... T31] ← same instruction
Warp 1: [T32 T33 T34 ... T63] ← same instruction
GPU Memory Hierarchy
| Memory | Scope | Latency | Size |
|---|---|---|---|
| Registers | Per-thread | 0 cycles | ~256 KB/SM |
| Shared mem | Per-block | ~20 cycles | 64–228 KB/SM |
| L2 cache | Global | ~200 cycles | 40–96 MB |
| HBM/GDDR | Global | ~400 cycles | 24–192 GB |
Occupancy
Occupancy = active warps / max warps per SM. Higher occupancy helps hide memory latency through warp switching. But registers and shared memory are finite — more per thread means fewer concurrent warps.
For ML engineers: Understanding SM-level resource limits is essential for writing efficient custom CUDA kernels (e.g., FlashAttention, fused optimizers).