SIMD & Vectorization

SSE, AVX-512, NEON — processing multiple data elements per instruction for ML and HPC workloads.

simdavxvectorizationperformance

Single Instruction, Multiple Data

SIMD instructions operate on wide registers (128–512 bits) to process 4–16 floats in a single cycle. This is the backbone of fast BLAS kernels, inference runtimes, and media codecs.

ISA ExtensionRegister WidthFloats per Op (FP32)
SSE128 bit4
AVX2256 bit8
AVX-512512 bit16
ARM NEON128 bit4

Auto-Vectorization

Modern compilers (GCC, Clang) can auto-vectorize simple loops:

void add(float *a, const float *b, int n) {
    for (int i = 0; i < n; i++)
        a[i] += b[i];
}

Compile with -O2 -march=native and check with -fopt-info-vec (GCC) or -Rpass=loop-vectorize (Clang).

Intrinsics

When the compiler can’t figure it out, hand-written intrinsics give direct control:

#include <immintrin.h>

void add_avx(float *a, const float *b, int n) {
    for (int i = 0; i <= n - 8; i += 8) {
        __m256 va = _mm256_loadu_ps(a + i);
        __m256 vb = _mm256_loadu_ps(b + i);
        _mm256_storeu_ps(a + i, _mm256_add_ps(va, vb));
    }
}

Practical Tips

  • Align buffers to 32/64 bytes for best throughput.
  • Avoid mixing SSE and AVX without _mm256_zeroupper() — transition penalties are real.
  • Profile with perf stat -e fp_arith_inst_retired.256b_packed_single to confirm vectorized execution.

For ML: Libraries like oneDNN and XNNPACK already contain hand-tuned SIMD kernels — don’t reinvent unless you have very specific needs.