GPU Architecture

Warps, SMs, memory hierarchy — how GPUs achieve massive parallelism for AI workloads.

gpucudaparallelismnvidia

Why GPUs?

A CPU core is optimized for low-latency serial execution. A GPU trades single-thread performance for massive throughput — thousands of simple cores executing in lockstep.

CPU (high-end)GPU (high-end)
Cores24–9610 000–18 000 CUDA cores
Clock4–5.5 GHz1.5–2.5 GHz
Memory BW~100 GB/s~3 TB/s (HBM3e)
Peak FP32~2 TFLOPS~80+ TFLOPS

Streaming Multiprocessors (SMs)

An NVIDIA GPU is organized into SMs, each containing:

  • CUDA cores (FP32/INT32 units)
  • Tensor cores (matrix-multiply-accumulate)
  • Shared memory / L1 cache (configurable, typically 128–228 KB)
  • Warp schedulers

Warps and SIMT

Threads are grouped into warps of 32. All threads in a warp execute the same instruction — Single Instruction, Multiple Threads (SIMT). Divergent branches cause warp serialization.

Warp 0:  [T0  T1  T2 ... T31]  ← same instruction
Warp 1:  [T32 T33 T34 ... T63] ← same instruction

GPU Memory Hierarchy

MemoryScopeLatencySize
RegistersPer-thread0 cycles~256 KB/SM
Shared memPer-block~20 cycles64–228 KB/SM
L2 cacheGlobal~200 cycles40–96 MB
HBM/GDDRGlobal~400 cycles24–192 GB

Occupancy

Occupancy = active warps / max warps per SM. Higher occupancy helps hide memory latency through warp switching. But registers and shared memory are finite — more per thread means fewer concurrent warps.

For ML engineers: Understanding SM-level resource limits is essential for writing efficient custom CUDA kernels (e.g., FlashAttention, fused optimizers).