OK, i've written a simple benchmark which implements an elementwise multiply (A=B*C) in three different ways (standard C, intrinsics, hand coded assembly). On the face of things the results seem to indicate that the vectorization works best on medium sized inputs. If people could post the results of running the benchmark on their machines (takes ~1min) along with the output of gcc --version and their chip model, that wd be v useful. It should be compiled with: gcc -msse -O2 vec_bench.c -o vec_bench Here's two: CPU: Core Duo T2500 @ 2GHz gcc --version: gcc (GCC) 4.1.2 (Ubuntu 4.1.2-0ubuntu4) Problem size Simple Intrin Inline 100 0.0003ms (100.0%) 0.0002ms ( 67.7%) 0.0002ms ( 50.6%) 1000 0.0030ms (100.0%) 0.0021ms ( 69.2%) 0.0015ms ( 50.6%) 10000 0.0370ms (100.0%) 0.0267ms ( 72.0%) 0.0279ms ( 75.4%) 100000 0.2258ms (100.0%) 0.1469ms ( 65.0%) 0.1273ms ( 56.4%) 1000000 4.5690ms (100.0%) 4.4616ms ( 97.6%) 4.4185ms ( 96.7%) 10000000 47.0022ms (100.0%) 45.4100ms ( 96.6%) 44.4437ms ( 94.6%) CPU: Intel Xeon E5345 @ 2.33Ghz gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33) Problem size Simple Intrin Inline 100 0.0001ms (100.0%) 0.0001ms ( 69.2%) 0.0001ms ( 77.4%) 1000 0.0010ms (100.0%) 0.0008ms ( 78.1%) 0.0009ms ( 86.6%) 10000 0.0108ms (100.0%) 0.0088ms ( 81.2%) 0.0086ms ( 79.6%) 100000 0.1131ms (100.0%) 0.0897ms ( 79.3%) 0.0872ms ( 77.1%) 1000000 5.2103ms (100.0%) 3.9153ms ( 75.1%) 3.8328ms ( 73.6%) 10000000 54.1815ms (100.0%) 51.8286ms ( 95.7%) 51.4366ms ( 94.9%) James