SIMD & Vectorization
SSE, AVX-512, NEON — processing multiple data elements per instruction for ML and HPC workloads.
Single Instruction, Multiple Data
SIMD instructions operate on wide registers (128–512 bits) to process 4–16 floats in a single cycle. This is the backbone of fast BLAS kernels, inference runtimes, and media codecs.
| ISA Extension | Register Width | Floats per Op (FP32) |
|---|---|---|
| SSE | 128 bit | 4 |
| AVX2 | 256 bit | 8 |
| AVX-512 | 512 bit | 16 |
| ARM NEON | 128 bit | 4 |
Auto-Vectorization
Modern compilers (GCC, Clang) can auto-vectorize simple loops:
void add(float *a, const float *b, int n) {
for (int i = 0; i < n; i++)
a[i] += b[i];
}
Compile with -O2 -march=native and check with -fopt-info-vec (GCC) or -Rpass=loop-vectorize (Clang).
Intrinsics
When the compiler can’t figure it out, hand-written intrinsics give direct control:
#include <immintrin.h>
void add_avx(float *a, const float *b, int n) {
for (int i = 0; i <= n - 8; i += 8) {
__m256 va = _mm256_loadu_ps(a + i);
__m256 vb = _mm256_loadu_ps(b + i);
_mm256_storeu_ps(a + i, _mm256_add_ps(va, vb));
}
}
Practical Tips
- Align buffers to 32/64 bytes for best throughput.
- Avoid mixing SSE and AVX without
_mm256_zeroupper()— transition penalties are real. - Profile with
perf stat -e fp_arith_inst_retired.256b_packed_singleto confirm vectorized execution.
For ML: Libraries like oneDNN and XNNPACK already contain hand-tuned SIMD kernels — don’t reinvent unless you have very specific needs.