[Python-Dev] cpython (2.7): Fix comment blocks. Adjust blocksize to a power-of-two for better divmod

Mon Jun 24 02:16:32 CEST 2013

On Jun 18, 2013, at 12:00 AM, Gregory P. Smith <greg at krypto.org> wrote:

> Why did you do this in the 2.7 branch?
> 
> It hasn't been done in 3.3 or default

I worked on the 2.7 branch first because that was the one I had loaded
and the one where I did timings and code disassembly.  I intended to
post it to 3.3 and 3.4 as well over the weekend.   

Ideally, it makes maintenance simpler for me if I keep the branches
the same as possible.  I viewed the one-line struct transposition,
comment correction, and one-line blocklen change as being somewhat 
innocuous non-algorithmic changes.  The struct change fixed an unintended
cache miss for left links and the blocksize change causes the
deque_index code to compile more efficiently (using a right-shift
and bitwise-and and rather than a measurably more expensive
division and modulo calculation).

I truly wasn't expecting the Spanish Inquisition :-)

On Jun 22, 2013, at 1:43 PM, Scott Dial <scott+python-dev at scottdial.com> wrote:

> changeset 84248:f1dc30a1be72 2.7
> Arrange structure to match the common access patterns.
> 
>     1.1 --- a/Modules/_collectionsmodule.c
>     1.2 +++ b/Modules/_collectionsmodule.c
>     1.3 @@ -48,8 +48,8 @@
>     1.4
>     1.5  typedef struct BLOCK {
>     1.6      struct BLOCK *leftlink;
>     1.7 +    PyObject *data[BLOCKLEN];
>     1.8      struct BLOCK *rightlink;
>     1.9 -    PyObject *data[BLOCKLEN];
>    1.10  } block;
>    1.11
>    1.12  #define MAXFREEBLOCKS 10
> 
> , which seems like a strange micro-optimization, at best.

Yes, this is a micro-optimization.  In working on implementing
deque slicing for 3.4, I restudied the block access patterns.   
On an appendleft(), popleft() or extendleft() operation, the left link is
accessed immediately before or after the leftmost entry in the data block. 
The old struct arrangement can cause an unnecessary cache miss
when jumping leftward.  This was something I overlooked when I
originally wrote the code almost a decade ago.

On Jun 23, 2013, at 11:38 AM, Benjamin Peterson <benjamin at python.org> wrote:

> I've backed this one out, too.

Really, you reverted my one-line change within 24 hours of it being posted?
I can't be on-line every day and sometimes it takes a little while to catch up
with python email.

On Jun 22, 2013, at 2:55 PM, Guido van Rossum <guido at python.org> wrote:

> Actually the data buffer is an array of pointers too, so with the
> original BLOCKLEN value of 62, sizeof(block) would be 64 times
> sizeof(PyObject *). In the Py3 version of the source there's even a
> comment explaining this right before the #define BLOCKLEN. Raymond,
> can you explain?

I also tried to fix that comment so it would stop emphasizing the blocklen
being a multiple of the cache line.   Also long as there is a reasonably long
data block, it matters less whether the data block size is an exact multiple
of the cache line length (62 or 64 words of data versus a typical 8 byte
cache line).

The data block size does matter elsewhere though.
The benefit of the having the data block being a power-of-two
is that the deque_index computation can use bits shifts
rather division and modulo calculations.  The benefit
of switching data block size from 62 to 64 was measurable
(a few percent) and observable in the disassembly of the code.

I experimented with one other ordering as well
(putting the data block first and the links afterwards).
That saved the scaled-index byte in the generated code
but produced no measureable speed-up.

In short, I believe the following should go in:

* The comment fix. (Incorrectly suggesting that a 64-bit 
   Py_ssize_t would overflow).  The revised comment
   is simpler, more general, and correct.

* Putting the leftlink before the data block in the structure.

The follow is up for debate:

Changing the BLOCKLEN from 62 to 64 is debatable.
It measureably improved deque_index without an
observable negative effect on the other operations.

Lastly, there was a change I just put in to Py 3.4 replacing
the memcpy() with a simple loop and replacing the
"deque->" references with local variables.  Besides
giving a small speed-up, it made the code more clear
and less at the mercy of various implementations
of memcpy().

Ideally, I would like 2.7 and 3.3 to replace their use of
memcpy() as well, but the flavor of this thread suggests
that is right out.

Raymond

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20130623/9d70fa47/attachment-0001.html>