GPU Architecture — Silicon Notes

Why GPUs?

A CPU core is optimized for low-latency serial execution. A GPU trades single-thread performance for massive throughput — thousands of simple cores executing in lockstep.

	CPU (high-end)	GPU (high-end)
Cores	24–96	10 000–18 000 CUDA cores
Clock	4–5.5 GHz	1.5–2.5 GHz
Memory BW	~100 GB/s	~3 TB/s (HBM3e)
Peak FP32	~2 TFLOPS	~80+ TFLOPS

Streaming Multiprocessors (SMs)

An NVIDIA GPU is organized into SMs, each containing:

CUDA cores (FP32/INT32 units)
Tensor cores (matrix-multiply-accumulate)
Shared memory / L1 cache (configurable, typically 128–228 KB)
Warp schedulers

Warps and SIMT

Threads are grouped into warps of 32. All threads in a warp execute the same instruction — Single Instruction, Multiple Threads (SIMT). Divergent branches cause warp serialization.

Warp 0:  [T0  T1  T2 ... T31]  ← same instruction
Warp 1:  [T32 T33 T34 ... T63] ← same instruction

GPU Memory Hierarchy

Memory	Scope	Latency	Size
Registers	Per-thread	0 cycles	~256 KB/SM
Shared mem	Per-block	~20 cycles	64–228 KB/SM
L2 cache	Global	~200 cycles	40–96 MB
HBM/GDDR	Global	~400 cycles	24–192 GB

Occupancy

Occupancy = active warps / max warps per SM. Higher occupancy helps hide memory latency through warp switching. But registers and shared memory are finite — more per thread means fewer concurrent warps.

For ML engineers: Understanding SM-level resource limits is essential for writing efficient custom CUDA kernels (e.g., FlashAttention, fused optimizers).