[Numpy-discussion] Sorting refactor

Fri Jan 16 10:11:25 EST 2015

On 01/16/2015 03:14 PM, Lars Buitinck wrote:
> 2015-01-16 13:29 GMT+01:00  <numpy-discussion-request at scipy.org>:
>> Date: Fri, 16 Jan 2015 12:43:43 +0100
>> From: Julian Taylor <jtaylor.debian at googlemail.com>
>> Subject: Re: [Numpy-discussion] Sorting refactor
>> To: Discussion of Numerical Python <numpy-discussion at scipy.org>
>> Message-ID: <54B8F96F.7090903 at googlemail.com>
>> Content-Type: text/plain; charset=windows-1252
>>
>> On 16.01.2015 12:33, Lars Buitinck wrote:
>>> * Quicksort has a quadratic time worst case. I think it should be
>>> turned into an introsort [1] for O(n log n) worst case; we have the
>>> heapsort needed to do that.
>>
>> This probably rarely happens in numeric data, and we do have guaranteed
>> nlog runtime algorithms available.
> 
> It's no more likely or unlikely than in any other type of data
> (AFAIK), but it's possible for an adversary to DOS any (web server)
> code that uses np.sort.

if you are using numpy where an arbitrary user is allowed to control the
data passed to a non isolated environment you have a problem anyway.
numpy is far from secure software and there are likely hundreds of ways
to produce DOS and dozens of ways to do code execution in any nontrivial
numpy using application.

> 
>> I was also thinking about this, an advantage of a sorting network is
>> that it can be vectorized to be significantly faster than an insertion
>> sort. Handling NaN's should also be directly possible.
> 
> Tried that, and it didn't give any speedup, at least not without
> explicit vector instructions.
> 
> Just moving the NaNs aside didn't cost anything in my preliminary
> benchmarks (without sorting nets), the cost of the operation was
> almost exactly compensated by simpler comparisons.

an SSE2 implementation a 16 entry bitonic sort is available here:
https://github.com/mischasan/sse2/blob/master/ssesort.c
there is also a benchmark, on my machine its 6 times faster than
insertion sort.
But again this would only gain us 5-10% improvement at best as the
partition part of quicksort is still the major time consuming part.

> 
>> The issue is that its probably too much complicated code for only a very
>> small gain.
> 
> Maybe. The thing is that the selection algorithms are optimized for
> NaNs and seem to depend on the current comparison code. We'd need
> distinct <TYPE>_LT and <TYPE>_LT_NONAN for each <TYPE>.
> 
> The sorting nets themselves aren't complicated, just lengthy. My
> branch has the length-optimal (not optimally parallel) ones for n <=
> 16.
> 
>> But I assume you will get the same trouble as openmp but that needs
>> testing, also cilk+ in gcc is not really production ready yet, I got
>> lots of crashes when I last tried it (it will be in 5.0 though).
> 
> The data parallel constructs tend to crash the compiler, but task
> spawning seems to be stable in 4.9.2. I've still to see how it handles
> multiprocessing/fork.
> 
> What do you mean by will be in 5.0, did they do a big push?

gcc 5.0 changelog reports "full support for cilk plus".
Also all bugs I have filed have been fixed in 5.0.

> 
> 
>> Date: Fri, 16 Jan 2015 13:28:56 +0100
>> From: Da?id <davidmenhur at gmail.com>
>> Subject: Re: [Numpy-discussion] Sorting refactor
>> To: Discussion of Numerical Python <numpy-discussion at scipy.org>
>> Message-ID:
>>         <CAJhcF=1O5Own_5ydzu+To8HHbm3e66k=iuNQrEiaSDY23DNW6w at mail.gmail.com>
>> Content-Type: text/plain; charset=UTF-8
>>
>> On 16 January 2015 at 13:15, Eelco Hoogendoorn
>> <hoogendoorn.eelco at gmail.com> wrote:
>>> Perhaps an algorithm can be made faster, but often these multicore
>>> algorithms are also less efficient, and a less data-dependent way of putting
>>> my cores to good use would have been preferable. Potentially, other code
>>> could slow down due to cache trashing if too many parallel tasks run in
>>> parallel. Id rather be in charge of such matters myself; but I imagine
>>> adding a keyword arg for these matters would not be much trouble?
>>
>> As I understand it, that is where the strength of Cilk+ lies. It does
>> not force parallelisation, just suggests it. The decision to actually
>> spawn parallel is decided at runtime depending on the load of the
>> other cores.
> 
> cilk+ guarantees that the amount of space used by a pool of P threads
> is at most P times the stack space used by the sequential version (+ a
> constant). The idea is that you can say
> 
> for (i = 0; i < 1000000; i++) {
>     cilk_spawn f(a[i]);
> }
> 
> and it will never create more than P work items in memory, rather than
> 1e6, even if each f() spawns a bunch itself. Of course, it won't
> guarantee that OpenMP will not also spawn P threads and/or check that
> you're one of P processes cooperating on a task using multiprocessing.
> 
> Personally I'd rather have an np.setnumthreads() to turn this on or
> off for a process and have the runtime distribute work for me instead
> of having to do it myself.
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>