[Pandas-dev] #3089 [PERF: regression from 0.10.1] discussion

Wed Mar 20 19:56:17 CET 2013

awesome explanation Stephen!

I'd vote for #2

essentially create a testing constructor (kind of like y-p's mkdf),
but creates only a numpy random array, that by default is c-continguous
(with option for f ), and then use that where we have (EVERYWHERE)!
np.random.randn.......

and second I guess if it helps, look at the c/f contiguous ness
of the ops where appropriate...

my 2c

On Wed, Mar 20, 2013 at 2:24 PM, Stephen Lin <swlin at post.harvard.edu> wrote:

> OK, here goes, the issue is the following...
>
> The optimization is question optimizes to row-by-row or
> column-by-column copying for 2-d arrays when possible, namely when:
>
> 1. the input array (where the array in question is Block.values) is
> c-contiguous for takes along axis0 or f-contiguous for takes along
> axis1 of the array, and
> 2. the contiguity of the output array matches the contiguity of the input
>
> Almost all the time, Block.values is stored c-contiguously, such that
> each row of the Block corresponds to a column of the DataFrame. So the
> optimization only really kicks in, effectively, when reindexing along
> the column axis of the DataFrame (i.e. axis 0 of the Block); it
> basically means we call memmove once per DataFrame column rather than
> iterating in a loop and copying elements.  This is good because most
> sane DataFrame objects are have more rows than columns, so we call
> memmove few times (i.e. once per column) for a large block of values
> (i.e. all rows for that column at a time), so any overhead from
> calling memmove will be outweighed by the benefit of a hand optimized
> copy (which probably involves vectorization, alignment/cache
> optimization, loop unrolling, etc.)
>
> C-contiguous blocks result from basically every Pandas operation that
> operates on blocks, with the only exceptions of (as far as I can tell)
> creating a DataFrame directly from a 2-d ndarray or creating the
> transpose of a homogenous DataFrame (but not a heterogenous one)
> without copying; this is basically an optimization to avoid creating
> the c-contigous version of an array when the f-contiguous one is
> already available, but it's the exception rather than the rule and
> pretty any modification of the DataFrame will immediately require
> reallocation and copying to a new c-contiguous block.
>
> Unfortunately many of the DataFrame tests, including the two in
> question here, are (for simplicity) only testing the case where a
> homogenous 2-d data is passed to the DataFrame, which results in
> (non-representative) f-contiguous blocks. An additional issue with
> this test is that it's creating a very long but thin array (10,000
> long, 4 wide) and reindexing along the index dimension, so row-by-row
> (from the DataFrame perspective) copying is done over and over using
> memmove on 4 element arrays. Furthermore, the alignment and width in
> bytes of each 4 element array happens to be a convenient multiple of
> 128bits, which is the multiple required for vectorized SIMD
> instructions, so it turns out the element-by-element copying is fairly
> efficient when such operations are available (as is guaranteed on
> x86-64, but not necessarily x86-32), and the call to memmove has more
> overhead than element-by-element copying.
>
> So the issue is basically only happening because all the following are
> true:
>
> 1. The DataFrame is constructed directly by a 2-d homogenous ndarray
> (which has the default c-contiguous continuity, so the block becomes
> f-contiguous).
> 2. There has been no operation after construction of the DataFrame
> requiring reallocation of any sort (otherwise the block would become
> c-contiguous).
> 3. The reindexing is done on the index axis (otherwise no optimization
> would be triggered, since it requires the right axis/contiguity
> combination).
> 4. The DataFrame is long but thin (otherwise memmove would not be
> called repeatedly to do small copies).
> 5. The C compiler is not inlining memmove properly, for whatever reason,
> and
> 6. (possibly) The alignment/width of the data happens to be such that
> SIMD operations can be used directly, so the overhead of the eliding
> the loop is not very great and exceeded by the overhead of the
> memmove.
>
> To be honest, it's common C practice to call memmove/memcpy (the
> performance of the two don't really differ from my testing in this
> case) even for very small arrays and assuming that the implementation
> is sane enough to inline it and do the right thing either way, so I'm
> really surprised about #5: I would not have thought it to be an issue
> with a modern compiler, since calling memcpy can't do anything but
> provide the compiler more, not less, information about your intentions
> (and the overhead of the memmove aliasing check is not significant
> here).
>
> Anyway, so it's a corner case, and I didn't catch it originally
> because I tested independently the effect of 1) allocates the output
> array to be f-contiguous instead of c-contiguous by default when the
> input array is f-contiguous and 2) converting loops into memmove when
> possible, both of which have a positive performance effect
> independently but combine to adversely affect these two tests.
>
> I can revert the change that "allocates the output array to be
> f-contiguous instead of c-contiguous by default when the input array
> is f-contiguous", meaning that this optimization will almost never be
> triggered for an f-contiguous input array (unless the caller
> explicitly provides an output array as f-contiguous), but I'd rather
> not because the optimization is actually kind of useful in less
> degenerate cases when you want to quickly produce a reindexed version
> of a f-contiguous array, for whatever reason, even though the cases
> are rarer.
>
> So I think what I'm going to do instead, to avoid the degenerate case
> above, is to trigger the optimization only when the take operation is
> done along the shorter of the two dimensions (i.e. so the copied
> dimension is the longer of the two): that will definitely fix this
> test (since it'll avoid this optimization completely) but I suppose
> there might be other degenerate cases I haven't thought about it. I'll
> submit a PR later today for this, if no one finds any objection to the
> idea.
>
> However, I think it might be skewed our performance results to be
> testing DataFrame objects constructed by 2-d ndarrays, since they're
> not representative; in addition to the issue above, it means that many
> tests are actually incorporating the cost of converting an
> f-contiguous array into a c-contiguous array on top of what they're
> actually trying to test. Two possible solutions are:
>
> 1. Change DataFrame constructor (and possibly DataFrame.T) to
> normalize all blocks as c-contiguous.
> 2. Leave DataFrame constructor as-is but either change existing tests
> to exercise the more common use case (c-contiguous blocks) or add them
> in addition to the current ones.
>
> I think #2 is probably best, since #1 will have a performance impact
> for the use cases (however rare) where an entire workflow can avoid
> triggering conversion from f-contiguous blocks to c-contiguous blocks.
>
> Let me know what you all think,
> Stephen
>
> On Wed, Mar 20, 2013 at 1:25 AM, Stephen Lin <swlin at post.harvard.edu>
> wrote:
> > Ahh! I figured it out...the platform issue is part of it, but mostly
> > it's that two (independently tested) commits had a weird effect when
> > merged.
> >
> > And the reason they did so is because this particular test turns out
> > all of our reindexing tests are testing something very
> > non-representative, because of the way they're constructed, so we're
> > not really getting representative performance data unfortunately (it
> > has to do with the DataFrame constructor and c-contiguity vs
> > f-contiguity). We should probably write new tests to fix this issue.
> >
> > I'll write up a fuller explanation when I get a chance. Anyway, sorry
> > for sending you on a git bisect goose chase, Jeff.
> >
> > Stephen
> >
> > On Wed, Mar 20, 2013 at 1:01 AM, Stephen Lin <swlin at post.harvard.edu>
> wrote:
> >> As per the "we're getting too chatty on GitHub" comment, should we be
> >> moving extended issue discussion about bugs to this list whenever
> >> possible?
> >>
> >> I posted a few comments on #3089 just now but realized maybe starting
> >> an e-mail chain would be better..
> >>
> >> Anyway, I'm looking into the issue, I suspect it's a corner case due
> >> to an array that's very large in one dimension but small in another,
> >> and possibly that there's compiler and architecture differences
> >> causing different results as well....Jeff, do you mind sending me your
> >> the output of "gcc -dumpmachine" and "gcc -dumpspecs" on the machine
> >> you ran vb_suite on?
> >>
> >> I'll set up a 64-bit dev machine going forward so I can test on both
> platforms.
> >>
> >> Thanks,
> >> Stephen
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> http://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20130320/353c7276/attachment-0001.html>