[New-bugs-announce] [issue31834] BLAKE2: the (pure) SSE2 impl forced on x86_64 is slower than reference

Michał Górny report at bugs.python.org
Sat Oct 21 03:57:11 EDT 2017

New submission from Michał Górny <mgorny at gentoo.org>:

The setup.py file for Python states:

        if (not cross_compiling and
                os.uname().machine == "x86_64" and
                sys.maxsize >  2**32):
            # Every x86_64 machine has at least SSE2.  Check for sys.maxsize
            # in case that kernel is 64-bit but userspace is 32-bit.
            blake2_macros.append(('BLAKE2_USE_SSE', '1'))

While the assertion about having SSE2 is true, it doesn't mean that it's worthwhile to use. I've tested pure (i.e. without SSSE3 and so on) on three different machines, getting the following results:

Athlon64 X2 (SSE2 is the best supported variant), 540 MiB of data:

SSE2: [5.189988004000043, 5.070812243997352]
ref:  [2.0161159170020255, 2.0475422790041193]

Core i3, same data file:

SSE2: [1.924425926999902, 1.92461746999993, 1.9298037500000191]
ref:  [1.7940209749999667, 1.7900855569999976, 1.7835538760000418]

Xeon E5630 server, 230 MiB data file:

SSE2: [0.7671358410007088, 0.7797677099879365, 0.7648976119962754]
ref:  [0.5784736709902063, 0.5717909929953748, 0.5717219939979259]

So in all the tested cases, pure SSE2 implementation is *slower* than the reference implementation. SSSE3 and other variants are faster and AFAIU they are enabled automatically based on CFLAGS, so it doesn't matter for most of the systems.

However, for old CPUs that do not support SSSE3, the choice of SSE2 makes the algorithm prohibitively slow -- it's 2.5 times slower than the reference implementation!

components: Extension Modules
messages: 304696
nosy: mgorny
priority: normal
severity: normal
status: open
title: BLAKE2: the (pure) SSE2 impl forced on x86_64 is slower than reference
type: performance
versions: Python 3.6

Python tracker <report at bugs.python.org>

More information about the New-bugs-announce mailing list