Memory Hierarchy
From registers to DRAM — understanding latency, bandwidth, and why cache-friendly code wins.
The Pyramid
| Level | Typical size | Latency (cycles) |
|---|---|---|
| Registers | ~few hundred B | 0–1 |
| L1 cache | 32–64 KB | ~4 |
| L2 cache | 256 KB – 1 MB | ~12 |
| L3 cache | 8–64 MB | ~40 |
| DRAM | 16–512 GB | ~200 |
| NVMe SSD | TB-scale | ~10 000+ |
Every level trades capacity for speed. A program that fits its working set in L1 will feel instant; one that thrashes DRAM will crawl.
Cache Lines and Spatial Locality
Caches operate on 64-byte lines. Accessing one byte pulls in the entire line, so iterating a contiguous array is essentially free after the first miss — this is spatial locality at work.
// cache-friendly: sequential access
for (int i = 0; i < N; i++)
sum += a[i];
// cache-hostile: strided access
for (int i = 0; i < N; i += STRIDE)
sum += a[i];
Prefetching
Hardware prefetchers detect sequential and strided patterns automatically. Software prefetch intrinsics (__builtin_prefetch, _mm_prefetch) give you explicit control when access patterns are irregular.
Rule of thumb: Measure with
perf stat— if your L1 miss rate is above ~5 %, there is likely a data layout opportunity.