[Numpy-discussion] [OT] Starving CPUs article featured in IEEE's ComputingNow portal
peridot.faceted at gmail.com
Sat Mar 20 17:36:24 EDT 2010
On 20 March 2010 16:18, Sebastian Haase <seb.haase at gmail.com> wrote:
> On Sat, Mar 20, 2010 at 8:22 PM, Anne Archibald
> <peridot.faceted at gmail.com> wrote:
>> On 20 March 2010 14:56, Dag Sverre Seljebotn
>> <dagss at student.matnat.uio.no> wrote:
>>> Pauli Virtanen wrote:
>>>> Anne Archibald wrote:
>>>>> I'm not knocking numpy; it does (almost) the best it can. (I'm not
>>>>> sure of the optimality of the order in which ufuncs are executed; I
>>>>> think some optimizations there are possible.)
>>>> Ufuncs and reductions are not performed in a cache-optimal fashion, IIRC
>>>> dimensions are always traversed in order from left to right. Large
>>>> speedups are possible in some cases, but in a quick try I didn't manage to
>>>> come up with an algorithm that would always improve the speed (there was a
>>>> thread about this last year or so, and there's a ticket). Things varied
>>>> between computers, so this probably depends a lot on the actual cache
>>>> But perhaps numexpr has such heuristics, and we could steal them?
>>> At least in MultiIter (and I always assumed ufuncs too, but perhaps not)
>>> there's functionality to remove the largest dimension so that it can be
>>> put innermost in a loop. In many situations, removing the dimension with
>>> the smallest stride from the iterator would probably work much better.
>>> It's all about balancing iterator overhead and memory overhead. Something
>>> simple like "select the dimension with length > 200 which has smallest
>>> stride, or the dimension with largest length if none are above 200" would
>>> perhaps work well?
>> There's more to it than that: there's no point (I think) going with
>> the smallest stride if that stride is more than a cache line (usually
>> 64 bytes). There are further optimizations - often
>> stride[i]*shape[i]==shape[j] for some (i,j), and in that case these
>> two dimensions can be treated as one with stride stride[i] and length
>> shape[i]*shape[j]. Applying this would "unroll" all C- or
>> Fortran-contiguous arrays to a single non-nested loop.
>> Of course, reordering ufuncs is potentially hazardous in terms of
>> breaking user code: the effect of
>> depends on whether you run through A in index order or reverse index
>> order (which might be the order it's stored in memory, potentially
>> faster from a cache point of view).
> I think Travis O suggested at some point to make this kind of
> "overlapping inplace" operations invalid.
> (maybe it was someone else ...)
> The idea was to protect new ("uninitiated") users from unexpected results.
> Anyway, maybe one could use (something like) np.seterror to determine
> how to handle these cases...
I was in on that discussion. My recollection of the conclusion was
that on the one hand they're useful, carefully applied, while on the
other hand they're very difficult to reliably detect (since you don't
want to forbid operations on non-overlapping slices of the same
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
More information about the NumPy-Discussion