Hi Robert,<br><br>

<div class="gmail_quote">On Thu, Feb 10, 2011 at 10:58 PM, Robert Kern <span dir="ltr"><<a href="mailto:robert.kern@gmail.com">robert.kern@gmail.com</a>></span> wrote:<br>

<blockquote style="BORDER-LEFT: #ccc 1px solid; MARGIN: 0px 0px 0px 0.8ex; PADDING-LEFT: 1ex" class="gmail_quote">

<div class="im">On Thu, Feb 10, 2011 at 14:29, eat <<a href="mailto:e.antero.tammi@gmail.com">e.antero.tammi@gmail.com</a>> wrote:<br>> Hi Robert,<br>><br>> On Thu, Feb 10, 2011 at 8:16 PM, Robert Kern <<a href="mailto:robert.kern@gmail.com">robert.kern@gmail.com</a>> wrote:<br>

>><br>>> On Thu, Feb 10, 2011 at 11:53, eat <<a href="mailto:e.antero.tammi@gmail.com">e.antero.tammi@gmail.com</a>> wrote:<br>>> > Thanks Chuck,<br>>> ><br>>> > for replying. But don't you still feel very odd that dot outperforms sum<br>

>> > in<br>>> > your machine? Just to get it simply; why sum can't outperform dot?<br>>> > Whatever<br>>> > architecture (computer, cache) you have, it don't make any sense at all<br>

>> > that<br>>> > when performing significantly less instructions, you'll reach to spend<br>>> > more<br>>> > time ;-).<br>>><br>>> These days, the determining factor is less often instruction count<br>

>> than memory latency, and the optimized BLAS implementations of dot()<br>>> heavily optimize the memory access patterns.<br>><br>> Can't we have this as well with simple sum?<br><br></div>It's technically feasible to accomplish, but as I mention later, it<br>

entails quite a large cost. Those optimized BLASes represent many<br>man-years of effort</blockquote>

<div>Yes I acknowledge this. But didn't they then  ignore them something simpler, like sum (but which actually could benefit exactly similiar optimizations).</div>

<blockquote style="BORDER-LEFT: #ccc 1px solid; MARGIN: 0px 0px 0px 0.8ex; PADDING-LEFT: 1ex" class="gmail_quote">and cause substantial headaches for people<br>building and installing numpy.</blockquote>

<div>I appreciate this. No doubt at all.</div>

<blockquote style="BORDER-LEFT: #ccc 1px solid; MARGIN: 0px 0px 0px 0.8ex; PADDING-LEFT: 1ex" class="gmail_quote">However, they are frequently worth it<br>because those operations are often bottlenecks in whole applications.<br>

sum(), even in its stupidest implementation, rarely is. In the places<br>where it is a significant bottleneck, an ad hoc implementation in C or<br>Cython or even FORTRAN for just that application is pretty easy to<br>write.</blockquote>


<div>But here I have to disagree; I'll think that at least I (if not even the majority of numpy users) don't like (nor I'm be capable/ or have enough time/ resources) go to dwell such details. I'm sorry but I'll have to restate that it's quite reasonable to expect that sum outperforms dot in any case. Lets now to start make such movements, which enables sum to outperform dot. </div>


<blockquote style="BORDER-LEFT: #ccc 1px solid; MARGIN: 0px 0px 0px 0.8ex; PADDING-LEFT: 1ex" class="gmail_quote">You can gain speed by specializing to just your use case, e.g.<br>contiguous data, summing down to one number, or summing along one axis<br>

of only 2D data, etc. There's usually no reason to try to generalize<br>that implementation to put it back into numpy.</blockquote>

<div>Yes, I would really like to specialize into my case, but 'without going out the python realm.'</div>

<div> </div>

<div> </div>

<div>Thanks,</div>

<div>eat</div>

<blockquote style="BORDER-LEFT: #ccc 1px solid; MARGIN: 0px 0px 0px 0.8ex; PADDING-LEFT: 1ex" class="gmail_quote">

<div class="im"><br>>> Additionally, the number<br>>> of instructions in your dot() probably isn't that many more than the<br>>> sum(). The sum() is pretty dumb<br>><br>> But does it need to be?<br>

<br></div>As I also allude to later in my email, no, but there are still costs involved.<br>

<div class="im"><br>>> and just does a linear accumulation<br>>> using the ufunc reduce mechanism, so (m*n-1) ADDs plus quite a few<br>>> instructions for traversing the array in a generic manner. With fused<br>

>> multiply-adds, being able to assume contiguous data and ignore the<br>>> numpy iterator overhead, and applying divide-and-conquer kernels to<br>>> arrange sums, the optimized dot() implementations could have a<br>

>> comparable instruction count.<br>><br>> Couldn't sum benefit with similar logic?<br><br></div>Etc. I'm not going to keep repeating myself.<br><font color="#888888"><br>--<br></font>

<div>

<div></div>

<div class="h5">Robert Kern<br><br>"I have come to believe that the whole world is an enigma, a harmless<br>enigma that is made terrible by our own mad attempt to interpret it as<br>though it had an underlying truth."<br>

  -- Umberto Eco<br>_______________________________________________<br>NumPy-Discussion mailing list<br><a href="mailto:NumPy-Discussion@scipy.org">NumPy-Discussion@scipy.org</a><br><a href="http://mail.scipy.org/mailman/listinfo/numpy-discussion" target="_blank">http://mail.scipy.org/mailman/listinfo/numpy-discussion</a><br>

</div></div></blockquote></div><br>