awesome explanation Stephen!<div><br></div><div>I'd vote for #2</div><div><br></div><div>essentially create a testing constructor (kind of like y-p's mkdf),</div><div>but creates only a numpy random array, that by default is c-continguous</div>

<div>(with option for f ), and then use that where we have (EVERYWHERE)!</div><div>np.random.randn.......</div><div><br></div><div>and second I guess if it helps, look at the c/f contiguous ness </div><div>of the ops where appropriate...</div>

<div><br></div><div>my 2c</div><div><br></div><div><br></div><div><br></div><div><br></div><div><div class="gmail_quote">On Wed, Mar 20, 2013 at 2:24 PM, Stephen Lin <span dir="ltr"><<a href="mailto:swlin@post.harvard.edu" target="_blank">swlin@post.harvard.edu</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">OK, here goes, the issue is the following...<br>

<br>

The optimization is question optimizes to row-by-row or<br>

column-by-column copying for 2-d arrays when possible, namely when:<br>

<br>

1. the input array (where the array in question is Block.values) is<br>

c-contiguous for takes along axis0 or f-contiguous for takes along<br>

axis1 of the array, and<br>

2. the contiguity of the output array matches the contiguity of the input<br>

<br>

Almost all the time, Block.values is stored c-contiguously, such that<br>

each row of the Block corresponds to a column of the DataFrame. So the<br>

optimization only really kicks in, effectively, when reindexing along<br>

the column axis of the DataFrame (i.e. axis 0 of the Block); it<br>

basically means we call memmove once per DataFrame column rather than<br>

iterating in a loop and copying elements.  This is good because most<br>

sane DataFrame objects are have more rows than columns, so we call<br>

memmove few times (i.e. once per column) for a large block of values<br>

(i.e. all rows for that column at a time), so any overhead from<br>

calling memmove will be outweighed by the benefit of a hand optimized<br>

copy (which probably involves vectorization, alignment/cache<br>

optimization, loop unrolling, etc.)<br>

<br>

C-contiguous blocks result from basically every Pandas operation that<br>

operates on blocks, with the only exceptions of (as far as I can tell)<br>

creating a DataFrame directly from a 2-d ndarray or creating the<br>

transpose of a homogenous DataFrame (but not a heterogenous one)<br>

without copying; this is basically an optimization to avoid creating<br>

the c-contigous version of an array when the f-contiguous one is<br>

already available, but it's the exception rather than the rule and<br>

pretty any modification of the DataFrame will immediately require<br>

reallocation and copying to a new c-contiguous block.<br>

<br>

Unfortunately many of the DataFrame tests, including the two in<br>

question here, are (for simplicity) only testing the case where a<br>

homogenous 2-d data is passed to the DataFrame, which results in<br>

(non-representative) f-contiguous blocks. An additional issue with<br>

this test is that it's creating a very long but thin array (10,000<br>

long, 4 wide) and reindexing along the index dimension, so row-by-row<br>

(from the DataFrame perspective) copying is done over and over using<br>

memmove on 4 element arrays. Furthermore, the alignment and width in<br>

bytes of each 4 element array happens to be a convenient multiple of<br>

128bits, which is the multiple required for vectorized SIMD<br>

instructions, so it turns out the element-by-element copying is fairly<br>

efficient when such operations are available (as is guaranteed on<br>

x86-64, but not necessarily x86-32), and the call to memmove has more<br>

overhead than element-by-element copying.<br>

<br>

So the issue is basically only happening because all the following are true:<br>

<br>

1. The DataFrame is constructed directly by a 2-d homogenous ndarray<br>

(which has the default c-contiguous continuity, so the block becomes<br>

f-contiguous).<br>

2. There has been no operation after construction of the DataFrame<br>

requiring reallocation of any sort (otherwise the block would become<br>

c-contiguous).<br>

3. The reindexing is done on the index axis (otherwise no optimization<br>

would be triggered, since it requires the right axis/contiguity<br>

combination).<br>

4. The DataFrame is long but thin (otherwise memmove would not be<br>

called repeatedly to do small copies).<br>

5. The C compiler is not inlining memmove properly, for whatever reason, and<br>

6. (possibly) The alignment/width of the data happens to be such that<br>

SIMD operations can be used directly, so the overhead of the eliding<br>

the loop is not very great and exceeded by the overhead of the<br>

memmove.<br>

<br>

To be honest, it's common C practice to call memmove/memcpy (the<br>

performance of the two don't really differ from my testing in this<br>

case) even for very small arrays and assuming that the implementation<br>

is sane enough to inline it and do the right thing either way, so I'm<br>

really surprised about #5: I would not have thought it to be an issue<br>

with a modern compiler, since calling memcpy can't do anything but<br>

provide the compiler more, not less, information about your intentions<br>

(and the overhead of the memmove aliasing check is not significant<br>

here).<br>

<br>

Anyway, so it's a corner case, and I didn't catch it originally<br>

because I tested independently the effect of 1) allocates the output<br>

array to be f-contiguous instead of c-contiguous by default when the<br>

input array is f-contiguous and 2) converting loops into memmove when<br>

possible, both of which have a positive performance effect<br>

independently but combine to adversely affect these two tests.<br>

<br>

I can revert the change that "allocates the output array to be<br>

f-contiguous instead of c-contiguous by default when the input array<br>

is f-contiguous", meaning that this optimization will almost never be<br>

triggered for an f-contiguous input array (unless the caller<br>

explicitly provides an output array as f-contiguous), but I'd rather<br>

not because the optimization is actually kind of useful in less<br>

degenerate cases when you want to quickly produce a reindexed version<br>

of a f-contiguous array, for whatever reason, even though the cases<br>

are rarer.<br>

<br>

So I think what I'm going to do instead, to avoid the degenerate case<br>

above, is to trigger the optimization only when the take operation is<br>

done along the shorter of the two dimensions (i.e. so the copied<br>

dimension is the longer of the two): that will definitely fix this<br>

test (since it'll avoid this optimization completely) but I suppose<br>

there might be other degenerate cases I haven't thought about it. I'll<br>

submit a PR later today for this, if no one finds any objection to the<br>

idea.<br>

<br>

However, I think it might be skewed our performance results to be<br>

testing DataFrame objects constructed by 2-d ndarrays, since they're<br>

not representative; in addition to the issue above, it means that many<br>

tests are actually incorporating the cost of converting an<br>

f-contiguous array into a c-contiguous array on top of what they're<br>

actually trying to test. Two possible solutions are:<br>

<br>

1. Change DataFrame constructor (and possibly DataFrame.T) to<br>

normalize all blocks as c-contiguous.<br>

2. Leave DataFrame constructor as-is but either change existing tests<br>

to exercise the more common use case (c-contiguous blocks) or add them<br>

in addition to the current ones.<br>

<br>

I think #2 is probably best, since #1 will have a performance impact<br>

for the use cases (however rare) where an entire workflow can avoid<br>

triggering conversion from f-contiguous blocks to c-contiguous blocks.<br>

<br>

Let me know what you all think,<br>

Stephen<br>

<div class="HOEnZb"><div class="h5"><br>

On Wed, Mar 20, 2013 at 1:25 AM, Stephen Lin <<a href="mailto:swlin@post.harvard.edu">swlin@post.harvard.edu</a>> wrote:<br>

> Ahh! I figured it out...the platform issue is part of it, but mostly<br>

> it's that two (independently tested) commits had a weird effect when<br>

> merged.<br>

><br>

> And the reason they did so is because this particular test turns out<br>

> all of our reindexing tests are testing something very<br>

> non-representative, because of the way they're constructed, so we're<br>

> not really getting representative performance data unfortunately (it<br>

> has to do with the DataFrame constructor and c-contiguity vs<br>

> f-contiguity). We should probably write new tests to fix this issue.<br>

><br>

> I'll write up a fuller explanation when I get a chance. Anyway, sorry<br>

> for sending you on a git bisect goose chase, Jeff.<br>

><br>

> Stephen<br>

><br>

> On Wed, Mar 20, 2013 at 1:01 AM, Stephen Lin <<a href="mailto:swlin@post.harvard.edu">swlin@post.harvard.edu</a>> wrote:<br>

>> As per the "we're getting too chatty on GitHub" comment, should we be<br>

>> moving extended issue discussion about bugs to this list whenever<br>

>> possible?<br>

>><br>

>> I posted a few comments on #3089 just now but realized maybe starting<br>

>> an e-mail chain would be better..<br>

>><br>

>> Anyway, I'm looking into the issue, I suspect it's a corner case due<br>

>> to an array that's very large in one dimension but small in another,<br>

>> and possibly that there's compiler and architecture differences<br>

>> causing different results as well....Jeff, do you mind sending me your<br>

>> the output of "gcc -dumpmachine" and "gcc -dumpspecs" on the machine<br>

>> you ran vb_suite on?<br>

>><br>

>> I'll set up a 64-bit dev machine going forward so I can test on both platforms.<br>

>><br>

>> Thanks,<br>

>> Stephen<br>

_______________________________________________<br>

Pandas-dev mailing list<br>

<a href="mailto:Pandas-dev@python.org">Pandas-dev@python.org</a><br>

<a href="http://mail.python.org/mailman/listinfo/pandas-dev" target="_blank">http://mail.python.org/mailman/listinfo/pandas-dev</a><br>

</div></div></blockquote></div><br></div>