Re: [Numpy-discussion] NumPy-Discussion Digest, Vol 100, Issue 28
2015-01-16 13:29 GMT+01:00 <numpy-discussion-request@scipy.org>:
Date: Fri, 16 Jan 2015 12:43:43 +0100 From: Julian Taylor <jtaylor.debian@googlemail.com> Subject: Re: [Numpy-discussion] Sorting refactor To: Discussion of Numerical Python <numpy-discussion@scipy.org> Message-ID: <54B8F96F.7090903@googlemail.com> Content-Type: text/plain; charset=windows-1252
On 16.01.2015 12:33, Lars Buitinck wrote:
* Quicksort has a quadratic time worst case. I think it should be turned into an introsort [1] for O(n log n) worst case; we have the heapsort needed to do that.
This probably rarely happens in numeric data, and we do have guaranteed nlog runtime algorithms available.
It's no more likely or unlikely than in any other type of data (AFAIK), but it's possible for an adversary to DOS any (web server) code that uses np.sort.
I was also thinking about this, an advantage of a sorting network is that it can be vectorized to be significantly faster than an insertion sort. Handling NaN's should also be directly possible.
Tried that, and it didn't give any speedup, at least not without explicit vector instructions. Just moving the NaNs aside didn't cost anything in my preliminary benchmarks (without sorting nets), the cost of the operation was almost exactly compensated by simpler comparisons.
The issue is that its probably too much complicated code for only a very small gain.
Maybe. The thing is that the selection algorithms are optimized for NaNs and seem to depend on the current comparison code. We'd need distinct <TYPE>_LT and <TYPE>_LT_NONAN for each <TYPE>. The sorting nets themselves aren't complicated, just lengthy. My branch has the length-optimal (not optimally parallel) ones for n <= 16.
But I assume you will get the same trouble as openmp but that needs testing, also cilk+ in gcc is not really production ready yet, I got lots of crashes when I last tried it (it will be in 5.0 though).
The data parallel constructs tend to crash the compiler, but task spawning seems to be stable in 4.9.2. I've still to see how it handles multiprocessing/fork. What do you mean by will be in 5.0, did they do a big push?
Date: Fri, 16 Jan 2015 13:28:56 +0100 From: Da?id <davidmenhur@gmail.com> Subject: Re: [Numpy-discussion] Sorting refactor To: Discussion of Numerical Python <numpy-discussion@scipy.org> Message-ID: <CAJhcF=1O5Own_5ydzu+To8HHbm3e66k=iuNQrEiaSDY23DNW6w@mail.gmail.com> Content-Type: text/plain; charset=UTF-8
On 16 January 2015 at 13:15, Eelco Hoogendoorn <hoogendoorn.eelco@gmail.com> wrote:
Perhaps an algorithm can be made faster, but often these multicore algorithms are also less efficient, and a less data-dependent way of putting my cores to good use would have been preferable. Potentially, other code could slow down due to cache trashing if too many parallel tasks run in parallel. Id rather be in charge of such matters myself; but I imagine adding a keyword arg for these matters would not be much trouble?
As I understand it, that is where the strength of Cilk+ lies. It does not force parallelisation, just suggests it. The decision to actually spawn parallel is decided at runtime depending on the load of the other cores.
cilk+ guarantees that the amount of space used by a pool of P threads is at most P times the stack space used by the sequential version (+ a constant). The idea is that you can say for (i = 0; i < 1000000; i++) { cilk_spawn f(a[i]); } and it will never create more than P work items in memory, rather than 1e6, even if each f() spawns a bunch itself. Of course, it won't guarantee that OpenMP will not also spawn P threads and/or check that you're one of P processes cooperating on a task using multiprocessing. Personally I'd rather have an np.setnumthreads() to turn this on or off for a process and have the runtime distribute work for me instead of having to do it myself.
On 01/16/2015 03:14 PM, Lars Buitinck wrote:
2015-01-16 13:29 GMT+01:00 <numpy-discussion-request@scipy.org>:
Date: Fri, 16 Jan 2015 12:43:43 +0100 From: Julian Taylor <jtaylor.debian@googlemail.com> Subject: Re: [Numpy-discussion] Sorting refactor To: Discussion of Numerical Python <numpy-discussion@scipy.org> Message-ID: <54B8F96F.7090903@googlemail.com> Content-Type: text/plain; charset=windows-1252
On 16.01.2015 12:33, Lars Buitinck wrote:
* Quicksort has a quadratic time worst case. I think it should be turned into an introsort [1] for O(n log n) worst case; we have the heapsort needed to do that.
This probably rarely happens in numeric data, and we do have guaranteed nlog runtime algorithms available.
It's no more likely or unlikely than in any other type of data (AFAIK), but it's possible for an adversary to DOS any (web server) code that uses np.sort.
if you are using numpy where an arbitrary user is allowed to control the data passed to a non isolated environment you have a problem anyway. numpy is far from secure software and there are likely hundreds of ways to produce DOS and dozens of ways to do code execution in any nontrivial numpy using application.
I was also thinking about this, an advantage of a sorting network is that it can be vectorized to be significantly faster than an insertion sort. Handling NaN's should also be directly possible.
Tried that, and it didn't give any speedup, at least not without explicit vector instructions.
Just moving the NaNs aside didn't cost anything in my preliminary benchmarks (without sorting nets), the cost of the operation was almost exactly compensated by simpler comparisons.
an SSE2 implementation a 16 entry bitonic sort is available here: https://github.com/mischasan/sse2/blob/master/ssesort.c there is also a benchmark, on my machine its 6 times faster than insertion sort. But again this would only gain us 5-10% improvement at best as the partition part of quicksort is still the major time consuming part.
The issue is that its probably too much complicated code for only a very small gain.
Maybe. The thing is that the selection algorithms are optimized for NaNs and seem to depend on the current comparison code. We'd need distinct <TYPE>_LT and <TYPE>_LT_NONAN for each <TYPE>.
The sorting nets themselves aren't complicated, just lengthy. My branch has the length-optimal (not optimally parallel) ones for n <= 16.
But I assume you will get the same trouble as openmp but that needs testing, also cilk+ in gcc is not really production ready yet, I got lots of crashes when I last tried it (it will be in 5.0 though).
The data parallel constructs tend to crash the compiler, but task spawning seems to be stable in 4.9.2. I've still to see how it handles multiprocessing/fork.
What do you mean by will be in 5.0, did they do a big push?
gcc 5.0 changelog reports "full support for cilk plus". Also all bugs I have filed have been fixed in 5.0.
Date: Fri, 16 Jan 2015 13:28:56 +0100 From: Da?id <davidmenhur@gmail.com> Subject: Re: [Numpy-discussion] Sorting refactor To: Discussion of Numerical Python <numpy-discussion@scipy.org> Message-ID: <CAJhcF=1O5Own_5ydzu+To8HHbm3e66k=iuNQrEiaSDY23DNW6w@mail.gmail.com> Content-Type: text/plain; charset=UTF-8
On 16 January 2015 at 13:15, Eelco Hoogendoorn <hoogendoorn.eelco@gmail.com> wrote:
Perhaps an algorithm can be made faster, but often these multicore algorithms are also less efficient, and a less data-dependent way of putting my cores to good use would have been preferable. Potentially, other code could slow down due to cache trashing if too many parallel tasks run in parallel. Id rather be in charge of such matters myself; but I imagine adding a keyword arg for these matters would not be much trouble?
As I understand it, that is where the strength of Cilk+ lies. It does not force parallelisation, just suggests it. The decision to actually spawn parallel is decided at runtime depending on the load of the other cores.
cilk+ guarantees that the amount of space used by a pool of P threads is at most P times the stack space used by the sequential version (+ a constant). The idea is that you can say
for (i = 0; i < 1000000; i++) { cilk_spawn f(a[i]); }
and it will never create more than P work items in memory, rather than 1e6, even if each f() spawns a bunch itself. Of course, it won't guarantee that OpenMP will not also spawn P threads and/or check that you're one of P processes cooperating on a task using multiprocessing.
Personally I'd rather have an np.setnumthreads() to turn this on or off for a process and have the runtime distribute work for me instead of having to do it myself. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
I agree; an np.setnumthreads to manage a numpy-global threadpool makes sense to me. Of course there are a great many cases where just spawning as many threads as cores is a sensible default, but if this kind of behavior could not be overridden I could see that greatly reduce performance for some of my more complex projects On Fri, Jan 16, 2015 at 4:11 PM, Julian Taylor < jtaylor.debian@googlemail.com> wrote:
On 01/16/2015 03:14 PM, Lars Buitinck wrote:
2015-01-16 13:29 GMT+01:00 <numpy-discussion-request@scipy.org>:
Date: Fri, 16 Jan 2015 12:43:43 +0100 From: Julian Taylor <jtaylor.debian@googlemail.com> Subject: Re: [Numpy-discussion] Sorting refactor To: Discussion of Numerical Python <numpy-discussion@scipy.org> Message-ID: <54B8F96F.7090903@googlemail.com> Content-Type: text/plain; charset=windows-1252
On 16.01.2015 12:33, Lars Buitinck wrote:
* Quicksort has a quadratic time worst case. I think it should be turned into an introsort [1] for O(n log n) worst case; we have the heapsort needed to do that.
This probably rarely happens in numeric data, and we do have guaranteed nlog runtime algorithms available.
It's no more likely or unlikely than in any other type of data (AFAIK), but it's possible for an adversary to DOS any (web server) code that uses np.sort.
if you are using numpy where an arbitrary user is allowed to control the data passed to a non isolated environment you have a problem anyway. numpy is far from secure software and there are likely hundreds of ways to produce DOS and dozens of ways to do code execution in any nontrivial numpy using application.
I was also thinking about this, an advantage of a sorting network is that it can be vectorized to be significantly faster than an insertion sort. Handling NaN's should also be directly possible.
Tried that, and it didn't give any speedup, at least not without explicit vector instructions.
Just moving the NaNs aside didn't cost anything in my preliminary benchmarks (without sorting nets), the cost of the operation was almost exactly compensated by simpler comparisons.
an SSE2 implementation a 16 entry bitonic sort is available here: https://github.com/mischasan/sse2/blob/master/ssesort.c there is also a benchmark, on my machine its 6 times faster than insertion sort. But again this would only gain us 5-10% improvement at best as the partition part of quicksort is still the major time consuming part.
The issue is that its probably too much complicated code for only a very small gain.
Maybe. The thing is that the selection algorithms are optimized for NaNs and seem to depend on the current comparison code. We'd need distinct <TYPE>_LT and <TYPE>_LT_NONAN for each <TYPE>.
The sorting nets themselves aren't complicated, just lengthy. My branch has the length-optimal (not optimally parallel) ones for n <= 16.
But I assume you will get the same trouble as openmp but that needs testing, also cilk+ in gcc is not really production ready yet, I got lots of crashes when I last tried it (it will be in 5.0 though).
The data parallel constructs tend to crash the compiler, but task spawning seems to be stable in 4.9.2. I've still to see how it handles multiprocessing/fork.
What do you mean by will be in 5.0, did they do a big push?
gcc 5.0 changelog reports "full support for cilk plus". Also all bugs I have filed have been fixed in 5.0.
Date: Fri, 16 Jan 2015 13:28:56 +0100 From: Da?id <davidmenhur@gmail.com> Subject: Re: [Numpy-discussion] Sorting refactor To: Discussion of Numerical Python <numpy-discussion@scipy.org> Message-ID: <CAJhcF=1O5Own_5ydzu+To8HHbm3e66k=
Content-Type: text/plain; charset=UTF-8
On 16 January 2015 at 13:15, Eelco Hoogendoorn <hoogendoorn.eelco@gmail.com> wrote:
Perhaps an algorithm can be made faster, but often these multicore algorithms are also less efficient, and a less data-dependent way of
iuNQrEiaSDY23DNW6w@mail.gmail.com> putting
my cores to good use would have been preferable. Potentially, other code could slow down due to cache trashing if too many parallel tasks run in parallel. Id rather be in charge of such matters myself; but I imagine adding a keyword arg for these matters would not be much trouble?
As I understand it, that is where the strength of Cilk+ lies. It does not force parallelisation, just suggests it. The decision to actually spawn parallel is decided at runtime depending on the load of the other cores.
cilk+ guarantees that the amount of space used by a pool of P threads is at most P times the stack space used by the sequential version (+ a constant). The idea is that you can say
for (i = 0; i < 1000000; i++) { cilk_spawn f(a[i]); }
and it will never create more than P work items in memory, rather than 1e6, even if each f() spawns a bunch itself. Of course, it won't guarantee that OpenMP will not also spawn P threads and/or check that you're one of P processes cooperating on a task using multiprocessing.
Personally I'd rather have an np.setnumthreads() to turn this on or off for a process and have the runtime distribute work for me instead of having to do it myself. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (3)
-
Eelco Hoogendoorn
-
Julian Taylor
-
Lars Buitinck