Memory Hierarchy

From registers to DRAM — understanding latency, bandwidth, and why cache-friendly code wins.

memorycachelatencybandwidth

The Pyramid

LevelTypical sizeLatency (cycles)
Registers~few hundred B0–1
L1 cache32–64 KB~4
L2 cache256 KB – 1 MB~12
L3 cache8–64 MB~40
DRAM16–512 GB~200
NVMe SSDTB-scale~10 000+

Every level trades capacity for speed. A program that fits its working set in L1 will feel instant; one that thrashes DRAM will crawl.

Cache Lines and Spatial Locality

Caches operate on 64-byte lines. Accessing one byte pulls in the entire line, so iterating a contiguous array is essentially free after the first miss — this is spatial locality at work.

// cache-friendly: sequential access
for (int i = 0; i < N; i++)
    sum += a[i];

// cache-hostile: strided access
for (int i = 0; i < N; i += STRIDE)
    sum += a[i];

Prefetching

Hardware prefetchers detect sequential and strided patterns automatically. Software prefetch intrinsics (__builtin_prefetch, _mm_prefetch) give you explicit control when access patterns are irregular.

Rule of thumb: Measure with perf stat — if your L1 miss rate is above ~5 %, there is likely a data layout opportunity.