[Pandas-dev] #3089 [PERF: regression from 0.10.1] discussion

Wed Mar 20 19:24:24 CET 2013

OK, here goes, the issue is the following...

The optimization is question optimizes to row-by-row or
column-by-column copying for 2-d arrays when possible, namely when:

1. the input array (where the array in question is Block.values) is
c-contiguous for takes along axis0 or f-contiguous for takes along
axis1 of the array, and
2. the contiguity of the output array matches the contiguity of the input

Almost all the time, Block.values is stored c-contiguously, such that
each row of the Block corresponds to a column of the DataFrame. So the
optimization only really kicks in, effectively, when reindexing along
the column axis of the DataFrame (i.e. axis 0 of the Block); it
basically means we call memmove once per DataFrame column rather than
iterating in a loop and copying elements.  This is good because most
sane DataFrame objects are have more rows than columns, so we call
memmove few times (i.e. once per column) for a large block of values
(i.e. all rows for that column at a time), so any overhead from
calling memmove will be outweighed by the benefit of a hand optimized
copy (which probably involves vectorization, alignment/cache
optimization, loop unrolling, etc.)

C-contiguous blocks result from basically every Pandas operation that
operates on blocks, with the only exceptions of (as far as I can tell)
creating a DataFrame directly from a 2-d ndarray or creating the
transpose of a homogenous DataFrame (but not a heterogenous one)
without copying; this is basically an optimization to avoid creating
the c-contigous version of an array when the f-contiguous one is
already available, but it's the exception rather than the rule and
pretty any modification of the DataFrame will immediately require
reallocation and copying to a new c-contiguous block.

Unfortunately many of the DataFrame tests, including the two in
question here, are (for simplicity) only testing the case where a
homogenous 2-d data is passed to the DataFrame, which results in
(non-representative) f-contiguous blocks. An additional issue with
this test is that it's creating a very long but thin array (10,000
long, 4 wide) and reindexing along the index dimension, so row-by-row
(from the DataFrame perspective) copying is done over and over using
memmove on 4 element arrays. Furthermore, the alignment and width in
bytes of each 4 element array happens to be a convenient multiple of
128bits, which is the multiple required for vectorized SIMD
instructions, so it turns out the element-by-element copying is fairly
efficient when such operations are available (as is guaranteed on
x86-64, but not necessarily x86-32), and the call to memmove has more
overhead than element-by-element copying.

So the issue is basically only happening because all the following are true:

1. The DataFrame is constructed directly by a 2-d homogenous ndarray
(which has the default c-contiguous continuity, so the block becomes
f-contiguous).
2. There has been no operation after construction of the DataFrame
requiring reallocation of any sort (otherwise the block would become
c-contiguous).
3. The reindexing is done on the index axis (otherwise no optimization
would be triggered, since it requires the right axis/contiguity
combination).
4. The DataFrame is long but thin (otherwise memmove would not be
called repeatedly to do small copies).
5. The C compiler is not inlining memmove properly, for whatever reason, and
6. (possibly) The alignment/width of the data happens to be such that
SIMD operations can be used directly, so the overhead of the eliding
the loop is not very great and exceeded by the overhead of the
memmove.

To be honest, it's common C practice to call memmove/memcpy (the
performance of the two don't really differ from my testing in this
case) even for very small arrays and assuming that the implementation
is sane enough to inline it and do the right thing either way, so I'm
really surprised about #5: I would not have thought it to be an issue
with a modern compiler, since calling memcpy can't do anything but
provide the compiler more, not less, information about your intentions
(and the overhead of the memmove aliasing check is not significant
here).

Anyway, so it's a corner case, and I didn't catch it originally
because I tested independently the effect of 1) allocates the output
array to be f-contiguous instead of c-contiguous by default when the
input array is f-contiguous and 2) converting loops into memmove when
possible, both of which have a positive performance effect
independently but combine to adversely affect these two tests.

I can revert the change that "allocates the output array to be
f-contiguous instead of c-contiguous by default when the input array
is f-contiguous", meaning that this optimization will almost never be
triggered for an f-contiguous input array (unless the caller
explicitly provides an output array as f-contiguous), but I'd rather
not because the optimization is actually kind of useful in less
degenerate cases when you want to quickly produce a reindexed version
of a f-contiguous array, for whatever reason, even though the cases
are rarer.

So I think what I'm going to do instead, to avoid the degenerate case
above, is to trigger the optimization only when the take operation is
done along the shorter of the two dimensions (i.e. so the copied
dimension is the longer of the two): that will definitely fix this
test (since it'll avoid this optimization completely) but I suppose
there might be other degenerate cases I haven't thought about it. I'll
submit a PR later today for this, if no one finds any objection to the
idea.

However, I think it might be skewed our performance results to be
testing DataFrame objects constructed by 2-d ndarrays, since they're
not representative; in addition to the issue above, it means that many
tests are actually incorporating the cost of converting an
f-contiguous array into a c-contiguous array on top of what they're
actually trying to test. Two possible solutions are:

1. Change DataFrame constructor (and possibly DataFrame.T) to
normalize all blocks as c-contiguous.
2. Leave DataFrame constructor as-is but either change existing tests
to exercise the more common use case (c-contiguous blocks) or add them
in addition to the current ones.

I think #2 is probably best, since #1 will have a performance impact
for the use cases (however rare) where an entire workflow can avoid
triggering conversion from f-contiguous blocks to c-contiguous blocks.

Let me know what you all think,
Stephen

On Wed, Mar 20, 2013 at 1:25 AM, Stephen Lin <swlin at post.harvard.edu> wrote:
> Ahh! I figured it out...the platform issue is part of it, but mostly
> it's that two (independently tested) commits had a weird effect when
> merged.
>
> And the reason they did so is because this particular test turns out
> all of our reindexing tests are testing something very
> non-representative, because of the way they're constructed, so we're
> not really getting representative performance data unfortunately (it
> has to do with the DataFrame constructor and c-contiguity vs
> f-contiguity). We should probably write new tests to fix this issue.
>
> I'll write up a fuller explanation when I get a chance. Anyway, sorry
> for sending you on a git bisect goose chase, Jeff.
>
> Stephen
>
> On Wed, Mar 20, 2013 at 1:01 AM, Stephen Lin <swlin at post.harvard.edu> wrote:
>> As per the "we're getting too chatty on GitHub" comment, should we be
>> moving extended issue discussion about bugs to this list whenever
>> possible?
>>
>> I posted a few comments on #3089 just now but realized maybe starting
>> an e-mail chain would be better..
>>
>> Anyway, I'm looking into the issue, I suspect it's a corner case due
>> to an array that's very large in one dimension but small in another,
>> and possibly that there's compiler and architecture differences
>> causing different results as well....Jeff, do you mind sending me your
>> the output of "gcc -dumpmachine" and "gcc -dumpspecs" on the machine
>> you ran vb_suite on?
>>
>> I'll set up a 64-bit dev machine going forward so I can test on both platforms.
>>
>> Thanks,
>> Stephen