[Python-Dev] cpython (2.7): Fix comment blocks. Adjust blocksize to a power-of-two for better divmod

Mon Jun 24 12:50:01 CEST 2013

2013/6/24 Raymond Hettinger <raymond.hettinger at gmail.com>:
> Lastly, there was a change I just put in to Py 3.4 replacing
> the memcpy() with a simple loop and replacing the
> "deque->" references with local variables.  Besides
> giving a small speed-up, it made the code more clear
> and less at the mercy of various implementations
> of memcpy().
>
> Ideally, I would like 2.7 and 3.3 to replace their use of
> memcpy() as well, but the flavor of this thread suggests
> that is right out.

The specific memcpy() function is usually highly optimized with
assembler code for each architecture. The GNU libc now does better: it
can choose the fastest version depending on the CPU version (MMX, SSE,
etc.) at runtime. If I understood correctly, the glibc contains
different version of memcpy, and the dynamic linker (ld.so) chooses
the version depending on the CPU.

GCC is also able to inline memcpy() when the size is known at compile
time. I also saw two code paths when the size is only known at
runtime: inline version for small size, and function call for larger
copy. Python has a Py_MEMCPY which implements exactly that, but only
for Visual Studio. I suppose that Visual Studio does not implement
this optimization. By the way, Py_MEMCPY() is only used in few places.

So it's surprising to read that a dummy loop is faster than
memcpy()... even if I already see this in my own micro-benchmarks :-)
Do you have an idea on how we can decide between the dummy loop and
memcpy()? Using a benchmark? Or can it be decided just by reading the
C code?

What is the policy for using Py_MEMCPY() vs memcpy()?

Victor