Adds aarch64-specific SIMD paths (NEON always available on aarch64; SVE gated on nightly + non-Apple target) with routing logic in mod.rs that selects the best available instruction set at runtime.