Caches and Memory Hierarchy

From registers to DRAM — understanding latency, bandwidth, and why cache-friendly code wins.

Updated March 18, 2026

memorycachelatencybandwidth

The Pyramid

LevelTypical sizeLatency (cycles)
Registers~few hundred B0–1
L1 cache32–64 KB~4
L2 cache256 KB – 1 MB~12
L3 cache8–64 MB~40
DRAM16–512 GB~200
NVMe SSDTB-scale~10 000+

Every level trades capacity for speed. A program that fits its working set in L1 will feel instant; one that thrashes DRAM will crawl.

Registers

Registers are the fastest memory level. They are located on the same chip as the CPU cores and are used to store the most frequently used data and instructions.

Cache

Cache Lines and Spatial Locality

Caches operate on 64-byte lines. Accessing one byte pulls in the entire line, so iterating a contiguous array is essentially free after the first miss — this is spatial locality at work.

int32_t[] a = ...; // initialize array

// cache-friendly: sequential access
for (int i = 0; i < N; i++)
    sum += a[i];

// cache-hostile: strided access
for (int i = 0; i < N; i += STRIDE)
    sum += a[i];
Questions
  1. What is the STRIDE in the example code?
Answers
  1. A cache line is 64 bytes and each int32_t is 4 bytes, so STRIDE = 64 / 4 = 16. At this stride every access lands in a different cache line, eliminating all spatial locality benefit.

Cache Types

  • Inclusive cache: all cache lines are stored in the cache.
  • Exclusive cache: only the cache lines that are not in the cache are stored in the cache.

Cache

There is a fundemental trade off between the size of the cache and the number of cache lines that can be stored in the cache. The more cache lines that can be stored in the cache, the more cache lines that can be stored in the cache. The more cache lines that can be stored in the cache, the more cache lines that can be stored in the cache.

Prefetching

Hardware prefetchers detect sequential and strided patterns automatically. Software prefetch intrinsics (__builtin_prefetch, _mm_prefetch) give you explicit control when access patterns are irregular.

Rule of thumb: Measure with perf stat — if your L1 miss rate is above ~5 %, there is likely a data layout opportunity.