Mailman 3 [ANN] Nanny, faster NaN functions - NumPy-Discussion

[ANN] Nanny, faster NaN functions

Keith Goodman

19 Nov 2010 19 Nov '10

10:33 a.m.

===== Nanny ===== Nanny uses the magic of Cython to give you a faster, drop-in replacement for the NaN functions in NumPy and SciPy. For example:: >> import nanny as ny >> import numpy as np >> arr = np.random.rand(100, 100) >> timeit np.nansum(arr) 10000 loops, best of 3: 67.5 us per loop >> timeit ny.nansum(arr) 100000 loops, best of 3: 18.2 us per loop Let's not forget to add some NaNs:: >> arr[arr > 0.5] = np.nan >> timeit np.nansum(arr) 1000 loops, best of 3: 411 us per loop >> timeit ny.nansum(arr) 10000 loops, best of 3: 65 us per loop Nanny uses a separate Cython function for each combination of ndim, dtype, and axis. You can get rid of a lot of overhead (useful in an inner loop, e.g.) by directly importing the function that matches your problem:: >> arr = np.random.rand(10, 10) >> from nansum import nansum_2d_float64_axis1 >> timeit np.nansum(arr, axis=1) 10000 loops, best of 3: 25.5 us per loop >> timeit ny.nansum(arr, axis=1) 100000 loops, best of 3: 5.15 us per loop >> timeit nansum_2d_float64_axis1(arr) 1000000 loops, best of 3: 1.75 us per loop I put together Nanny as a way to learn Cython. It currently only supports: - functions: nansum - Operating systems: 64-bit (accumulator for int32 is hard coded to int64) - dtype: int32, int64, float64 - ndim: 1, 2, and 3 If there is interest in the project, I could continue adding the remaining NaN functions from NumPy and SciPy: nanmin, nanmax, nanmean, nanmedian (using a partial sort), nanstd. But why stop there? How about nancumsum or nanprod? Or anynan, which could short-circuit once a NaN is found? Feedback on the code or the direction of the project are welcomed. So is coding help---without that I doubt the package will ever be completed. Once nansum is complete, many of the remaining functions will be copy, paste, touch up operations. Remember, Nanny quickly protects your precious data from the corrupting influence of Mr. Nan. License ======= Nanny is distributed under a Simplified BSD license. Parts of NumPy, which has a BSD licenses, are included in Nanny. See the LICENSE file, which is distributed with Nanny, for details. Installation ============ You can grab Nanny at http://github.com/kwgoodman/nanny. nansum of ints is only supported by 64-bit operating systems at the moment. **GNU/Linux, Mac OS X, et al.** To install Nanny:: $ python setup.py build $ sudo python setup.py install Or, if you wish to specify where Nanny is installed, for example inside ``/usr/local``:: $ python setup.py build $ sudo python setup.py install --prefix=/usr/local **Windows** In order to compile the C code in Nanny you need a Windows version of the gcc compiler. MinGW (Minimalist GNU for Windows) contains gcc and has been used to successfully compile Nanny on Windows. Install MinGW and add it to your system path. Then install Nanny with the commands:: python setup.py build --compiler=mingw32 python setup.py install **Post install** After you have installed Nanny, run the suite of unit tests:: >>> import nanny >>> nanny.test() <snip> Ran 1 tests in 0.008s OK

Show replies by date

Nathaniel Smith

19 Nov 19 Nov

10:55 a.m.

On Fri, Nov 19, 2010 at 10:33 AM, Keith Goodman wrote:

...

Nanny uses the magic of Cython to give you a faster, drop-in replacement for the NaN functions in NumPy and SciPy.

Neat! Why not make this a patch to numpy/scipy instead?

...

Nanny uses a separate Cython function for each combination of ndim, dtype, and axis. You can get rid of a lot of overhead (useful in an inner loop, e.g.) by directly importing the function that matches your problem::

>> arr = np.random.rand(10, 10) >> from nansum import nansum_2d_float64_axis1

If this is really useful, then better to provide a function that finds the correct function for you? best_nansum = ny.get_best_nansum(ary[0, :, :], axis=1) for i in xrange(ary.shape[0]): best_nansum(ary[i, :, :], axis=1)

...

- functions: nansum - Operating systems: 64-bit (accumulator for int32 is hard coded to int64) - dtype: int32, int64, float64 - ndim: 1, 2, and 3

What does it even mean to do NaN operations on integers? (I'd sometimes find it *really convenient* if there were a NaN value for standard computer integers... but there isn't?) -- Nathaniel

Benjamin Root

11:12 a.m.

On Fri, Nov 19, 2010 at 12:55 PM, Nathaniel Smith wrote:

...

On Fri, Nov 19, 2010 at 10:33 AM, Keith Goodman wrote:

...
Nanny uses the magic of Cython to give you a faster, drop-in replacement for the NaN functions in NumPy and SciPy.

Neat!

Why not make this a patch to numpy/scipy instead?

...
Nanny uses a separate Cython function for each combination of ndim, dtype, and axis. You can get rid of a lot of overhead (useful in an inner loop, e.g.) by directly importing the function that matches your problem::

...
...
arr = np.random.rand(10, 10) from nansum import nansum_2d_float64_axis1

If this is really useful, then better to provide a function that finds the correct function for you?

best_nansum = ny.get_best_nansum(ary[0, :, :], axis=1) for i in xrange(ary.shape[0]): best_nansum(ary[i, :, :], axis=1)

...
- functions: nansum - Operating systems: 64-bit (accumulator for int32 is hard coded to int64) - dtype: int32, int64, float64 - ndim: 1, 2, and 3

What does it even mean to do NaN operations on integers? (I'd sometimes find it *really convenient* if there were a NaN value for standard computer integers... but there isn't?)

-- Nathaniel

That's why I use masked arrays. It is dtype agnostic. I am curious if there are any lessons that were learned in making Nanny that could be applied to the masked array functions? Ben Root

Keith Goodman

11:35 a.m.

On Fri, Nov 19, 2010 at 11:12 AM, Benjamin Root wrote:

...

That's why I use masked arrays. It is dtype agnostic.

I am curious if there are any lessons that were learned in making Nanny that could be applied to the masked array functions?

I suppose you could write a cython function that operates on masked arrays. But other than that, I can't think of any lessons. All I can think about is speed:

...

...
x = np.ma.array([[1, 2], [3, 4]], mask=[[0, 1], [1, 0]]) timeit np.sum(x) 10000 loops, best of 3: 25.1 us per loop a = np.array([[1, np.nan], [np.nan, 4]]) timeit ny.nansum(a) 100000 loops, best of 3: 3.11 us per loop from nansum import nansum_2d_float64_axisNone timeit nansum_2d_float64_axisNone(a) 1000000 loops, best of 3: 395 ns per loop

josef.pktd＠gmail.com

12:10 p.m.

On Fri, Nov 19, 2010 at 2:35 PM, Keith Goodman wrote:

...

On Fri, Nov 19, 2010 at 11:12 AM, Benjamin Root wrote:

...
That's why I use masked arrays. It is dtype agnostic.

I am curious if there are any lessons that were learned in making Nanny that could be applied to the masked array functions?

I suppose you could write a cython function that operates on masked arrays. But other than that, I can't think of any lessons. All I can think about is speed:

...
...
x = np.ma.array([[1, 2], [3, 4]], mask=[[0, 1], [1, 0]]) timeit np.sum(x) 10000 loops, best of 3: 25.1 us per loop a = np.array([[1, np.nan], [np.nan, 4]]) timeit ny.nansum(a) 100000 loops, best of 3: 3.11 us per loop from nansum import nansum_2d_float64_axisNone timeit nansum_2d_float64_axisNone(a) 1000000 loops, best of 3: 395 ns per loop

What's the speed advantage of nanny compared to np.nansum that you have if the arrays are larger, say (1000,10) or (10000,100) axis=0 ? Josef

...

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Keith Goodman

12:19 p.m.

On Fri, Nov 19, 2010 at 12:10 PM, wrote:

...

What's the speed advantage of nanny compared to np.nansum that you have if the arrays are larger, say (1000,10) or (10000,100) axis=0 ?

Good point. In the small examples I showed so far maybe the speed up was all in overhead. Fortunately, that's not the case:

...

...
arr = np.random.rand(1000, 1000) timeit np.nansum(arr) 100 loops, best of 3: 4.79 ms per loop timeit ny.nansum(arr) 1000 loops, best of 3: 1.53 ms per loop

...

...
arr[arr > 0.5] = np.nan timeit np.nansum(arr) 10 loops, best of 3: 44.5 ms per loop timeit ny.nansum(arr) 100 loops, best of 3: 6.18 ms per loop

...

...
timeit np.nansum(arr, axis=0) 10 loops, best of 3: 52.3 ms per loop timeit ny.nansum(arr, axis=0) 100 loops, best of 3: 12.2 ms per loop

np.nansum makes a copy of the input array and makes a mask (another copy) and then uses the mask to set the NaNs to zero in the copy. So not only is nanny faster, but it uses less memory.

Keith Goodman

11:19 a.m.

On Fri, Nov 19, 2010 at 10:55 AM, Nathaniel Smith wrote:

...

On Fri, Nov 19, 2010 at 10:33 AM, Keith Goodman wrote:

...
Nanny uses the magic of Cython to give you a faster, drop-in replacement for the NaN functions in NumPy and SciPy.

Neat!

Why not make this a patch to numpy/scipy instead?

My guess is that having separate underlying functions for each dtype, ndim, and axis would be a nightmare for a large project like Numpy. But manageable for a focused project like nanny.

...

...
Nanny uses a separate Cython function for each combination of ndim, dtype, and axis. You can get rid of a lot of overhead (useful in an inner loop, e.g.) by directly importing the function that matches your problem::

>> arr = np.random.rand(10, 10) >> from nansum import nansum_2d_float64_axis1

If this is really useful, then better to provide a function that finds the correct function for you?

best_nansum = ny.get_best_nansum(ary[0, :, :], axis=1) for i in xrange(ary.shape[0]): best_nansum(ary[i, :, :], axis=1)

That would be useful. It is what nanny.nansum does but it returns the sum instead of the function.

...

...
- functions: nansum - Operating systems: 64-bit (accumulator for int32 is hard coded to int64) - dtype: int32, int64, float64 - ndim: 1, 2, and 3

What does it even mean to do NaN operations on integers? (I'd sometimes find it *really convenient* if there were a NaN value for standard computer integers... but there isn't?)

Well, sometimes you write functions without knowing the dtype of the input.

Pauli Virtanen

12:19 p.m.

Fri, 19 Nov 2010 11:19:57 -0800, Keith Goodman wrote: [clip]

...

My guess is that having separate underlying functions for each dtype, ndim, and axis would be a nightmare for a large project like Numpy. But manageable for a focused project like nanny.

Might be easier to migrate the nan* functions to using Ufuncs. Unless I'm missing something, np.nanmax -> np.fmax.reduce np.nanmin -> np.fmin.reduce For `nansum`, we'd need to add an ufunc `nanadd`, and for `nanargmax/min`, we'd need `argfmin/fmax'. -- Pauli Virtanen

Keith Goodman

12:29 p.m.

On Fri, Nov 19, 2010 at 12:19 PM, Pauli Virtanen wrote:

...

Fri, 19 Nov 2010 11:19:57 -0800, Keith Goodman wrote: [clip]

...
My guess is that having separate underlying functions for each dtype, ndim, and axis would be a nightmare for a large project like Numpy. But manageable for a focused project like nanny.

Might be easier to migrate the nan* functions to using Ufuncs.

Unless I'm missing something,

np.nanmax -> np.fmax.reduce np.nanmin -> np.fmin.reduce

For `nansum`, we'd need to add an ufunc `nanadd`, and for `nanargmax/min`, we'd need `argfmin/fmax'.

How about that! I wasn't aware of fmax/fmin. Yes, I'd like a nanadd, please.

...

...
arr = np.random.rand(1000, 1000) arr[arr > 0.5] = np.nan np.nanmax(arr) 0.49999625409581072 np.fmax.reduce(arr, axis=None) <snip> TypeError: an integer is required np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0) 0.49999625409581072

...

...
timeit np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0) 100 loops, best of 3: 12.7 ms per loop timeit np.nanmax(arr) 10 loops, best of 3: 39.6 ms per loop

...

...
timeit np.nanmax(arr, axis=0) 10 loops, best of 3: 46.5 ms per loop timeit np.fmax.reduce(arr, axis=0) 100 loops, best of 3: 12.7 ms per loop

Keith Goodman

12:50 p.m.

On Fri, Nov 19, 2010 at 12:29 PM, Keith Goodman wrote:

...

On Fri, Nov 19, 2010 at 12:19 PM, Pauli Virtanen wrote:

...
Fri, 19 Nov 2010 11:19:57 -0800, Keith Goodman wrote: [clip]

...
My guess is that having separate underlying functions for each dtype, ndim, and axis would be a nightmare for a large project like Numpy. But manageable for a focused project like nanny.

Might be easier to migrate the nan* functions to using Ufuncs.

Unless I'm missing something,

np.nanmax -> np.fmax.reduce np.nanmin -> np.fmin.reduce

For `nansum`, we'd need to add an ufunc `nanadd`, and for `nanargmax/min`, we'd need `argfmin/fmax'.

How about that! I wasn't aware of fmax/fmin. Yes, I'd like a nanadd, please.

...
...
arr = np.random.rand(1000, 1000) arr[arr > 0.5] = np.nan np.nanmax(arr) 0.49999625409581072 np.fmax.reduce(arr, axis=None) <snip> TypeError: an integer is required np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0) 0.49999625409581072

...
...
timeit np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0) 100 loops, best of 3: 12.7 ms per loop timeit np.nanmax(arr) 10 loops, best of 3: 39.6 ms per loop

...
...
timeit np.nanmax(arr, axis=0) 10 loops, best of 3: 46.5 ms per loop timeit np.fmax.reduce(arr, axis=0) 100 loops, best of 3: 12.7 ms per loop

Cython is faster than np.fmax.reduce. I wrote a cython version of np.nanmax, called nanmax below. (It only handles the 2d, float64, axis=None case, but since the array is large I don't think that explains the time difference). Note that fmax.reduce is slower than np.nanmax when there are no NaNs:

...

...
arr = np.random.rand(1000, 1000) timeit np.nanmax(arr) 100 loops, best of 3: 5.82 ms per loop timeit np.fmax.reduce(np.fmax.reduce(arr)) 100 loops, best of 3: 9.14 ms per loop timeit nanmax(arr) 1000 loops, best of 3: 1.17 ms per loop

...

...
arr[arr > 0.5] = np.nan

...

...
timeit np.nanmax(arr) 10 loops, best of 3: 45.5 ms per loop timeit np.fmax.reduce(np.fmax.reduce(arr)) 100 loops, best of 3: 12.7 ms per loop timeit nanmax(arr) 1000 loops, best of 3: 1.17 ms per loop

Charles R Harris

7:19 p.m.

On Fri, Nov 19, 2010 at 1:50 PM, Keith Goodman wrote:

...

...
On Fri, Nov 19, 2010 at 12:19 PM, Pauli Virtanen wrote:

...
Fri, 19 Nov 2010 11:19:57 -0800, Keith Goodman wrote: [clip]

...
My guess is that having separate underlying functions for each dtype, ndim, and axis would be a nightmare for a large project like Numpy. But manageable for a focused project like nanny.

Might be easier to migrate the nan* functions to using Ufuncs.

Unless I'm missing something,

np.nanmax -> np.fmax.reduce np.nanmin -> np.fmin.reduce

For `nansum`, we'd need to add an ufunc `nanadd`, and for `nanargmax/min`, we'd need `argfmin/fmax'.

How about that! I wasn't aware of fmax/fmin. Yes, I'd like a nanadd,

On Fri, Nov 19, 2010 at 12:29 PM, Keith Goodman wrote: please.

...
...
...
arr = np.random.rand(1000, 1000) arr[arr > 0.5] = np.nan np.nanmax(arr) 0.49999625409581072 np.fmax.reduce(arr, axis=None) <snip> TypeError: an integer is required np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0) 0.49999625409581072

...
...
timeit np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0) 100 loops, best of 3: 12.7 ms per loop timeit np.nanmax(arr) 10 loops, best of 3: 39.6 ms per loop

...
...
timeit np.nanmax(arr, axis=0) 10 loops, best of 3: 46.5 ms per loop timeit np.fmax.reduce(arr, axis=0) 100 loops, best of 3: 12.7 ms per loop

Cython is faster than np.fmax.reduce.

I wrote a cython version of np.nanmax, called nanmax below. (It only handles the 2d, float64, axis=None case, but since the array is large I don't think that explains the time difference).

Note that fmax.reduce is slower than np.nanmax when there are no NaNs:

...
...
arr = np.random.rand(1000, 1000) timeit np.nanmax(arr) 100 loops, best of 3: 5.82 ms per loop timeit np.fmax.reduce(np.fmax.reduce(arr)) 100 loops, best of 3: 9.14 ms per loop timeit nanmax(arr) 1000 loops, best of 3: 1.17 ms per loop

...
...
arr[arr > 0.5] = np.nan

...
...
timeit np.nanmax(arr) 10 loops, best of 3: 45.5 ms per loop timeit np.fmax.reduce(np.fmax.reduce(arr)) 100 loops, best of 3: 12.7 ms per loop timeit nanmax(arr) 1000 loops, best of 3: 1.17 ms per loop

There seem to be some odd hardware/compiler dependencies. I get quite a different pattern of times: In [1]: arr = np.random.rand(1000, 1000) In [2]: timeit np.nanmax(arr) 100 loops, best of 3: 10.4 ms per loop In [3]: timeit np.fmax.reduce(arr.flat) 100 loops, best of 3: 2.09 ms per loop In [4]: arr[arr > 0.5] = np.nan In [5]: timeit np.nanmax(arr) 100 loops, best of 3: 12.9 ms per loop In [6]: timeit np.fmax.reduce(arr.flat) 100 loops, best of 3: 7.09 ms per loop I've tweaked fmax with the reduce loop option but the nanmax times don't look like yours at all. I'm also a bit surprised that you don't see any difference in times when the array contains a lot of nans. I'm running on AMD Phenom, gcc 4.4.5. Chuck

Charles R Harris

7:37 p.m.

On Fri, Nov 19, 2010 at 8:19 PM, Charles R Harris

...

wrote:

...

On Fri, Nov 19, 2010 at 1:50 PM, Keith Goodman wrote:

...
...
On Fri, Nov 19, 2010 at 12:19 PM, Pauli Virtanen wrote:

...
Fri, 19 Nov 2010 11:19:57 -0800, Keith Goodman wrote: [clip]

...
My guess is that having separate underlying functions for each dtype, ndim, and axis would be a nightmare for a large project like Numpy. But manageable for a focused project like nanny.

Might be easier to migrate the nan* functions to using Ufuncs.

Unless I'm missing something,

np.nanmax -> np.fmax.reduce np.nanmin -> np.fmin.reduce

For `nansum`, we'd need to add an ufunc `nanadd`, and for `nanargmax/min`, we'd need `argfmin/fmax'.

How about that! I wasn't aware of fmax/fmin. Yes, I'd like a nanadd,

On Fri, Nov 19, 2010 at 12:29 PM, Keith Goodman wrote: please.

...
...
...
arr = np.random.rand(1000, 1000) arr[arr > 0.5] = np.nan np.nanmax(arr) 0.49999625409581072 np.fmax.reduce(arr, axis=None) <snip> TypeError: an integer is required np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0) 0.49999625409581072

...
...
timeit np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0) 100 loops, best of 3: 12.7 ms per loop timeit np.nanmax(arr) 10 loops, best of 3: 39.6 ms per loop

...
...
timeit np.nanmax(arr, axis=0) 10 loops, best of 3: 46.5 ms per loop timeit np.fmax.reduce(arr, axis=0) 100 loops, best of 3: 12.7 ms per loop

Cython is faster than np.fmax.reduce.

I wrote a cython version of np.nanmax, called nanmax below. (It only handles the 2d, float64, axis=None case, but since the array is large I don't think that explains the time difference).

Note that fmax.reduce is slower than np.nanmax when there are no NaNs:

...
...
arr = np.random.rand(1000, 1000) timeit np.nanmax(arr) 100 loops, best of 3: 5.82 ms per loop timeit np.fmax.reduce(np.fmax.reduce(arr)) 100 loops, best of 3: 9.14 ms per loop timeit nanmax(arr) 1000 loops, best of 3: 1.17 ms per loop

...
...
arr[arr > 0.5] = np.nan

...
...
timeit np.nanmax(arr) 10 loops, best of 3: 45.5 ms per loop timeit np.fmax.reduce(np.fmax.reduce(arr)) 100 loops, best of 3: 12.7 ms per loop timeit nanmax(arr) 1000 loops, best of 3: 1.17 ms per loop

There seem to be some odd hardware/compiler dependencies. I get quite a different pattern of times:

In [1]: arr = np.random.rand(1000, 1000)

In [2]: timeit np.nanmax(arr) 100 loops, best of 3: 10.4 ms per loop

In [3]: timeit np.fmax.reduce(arr.flat) 100 loops, best of 3: 2.09 ms per loop

In [4]: arr[arr > 0.5] = np.nan

In [5]: timeit np.nanmax(arr) 100 loops, best of 3: 12.9 ms per loop

In [6]: timeit np.fmax.reduce(arr.flat) 100 loops, best of 3: 7.09 ms per loop

I've tweaked fmax with the reduce loop option but the nanmax times don't look like yours at all. I'm also a bit surprised that you don't see any difference in times when the array contains a lot of nans. I'm running on AMD Phenom, gcc 4.4.5.

However, I noticed that the build wants to be -O1 by default. I have my own CFLAGS that make it -O2, but It looks like ubuntu's python might be built with -O1. Hmm. That could certainly cause some odd timings. Chuck

Keith Goodman

7:42 p.m.

On Fri, Nov 19, 2010 at 7:19 PM, Charles R Harris wrote:

...

On Fri, Nov 19, 2010 at 1:50 PM, Keith Goodman wrote:

...
On Fri, Nov 19, 2010 at 12:29 PM, Keith Goodman wrote:

...
On Fri, Nov 19, 2010 at 12:19 PM, Pauli Virtanen wrote:

...
Fri, 19 Nov 2010 11:19:57 -0800, Keith Goodman wrote: [clip]

...
My guess is that having separate underlying functions for each dtype, ndim, and axis would be a nightmare for a large project like Numpy. But manageable for a focused project like nanny.

Might be easier to migrate the nan* functions to using Ufuncs.

Unless I'm missing something,

np.nanmax -> np.fmax.reduce np.nanmin -> np.fmin.reduce

For `nansum`, we'd need to add an ufunc `nanadd`, and for `nanargmax/min`, we'd need `argfmin/fmax'.

How about that! I wasn't aware of fmax/fmin. Yes, I'd like a nanadd, please.

...
...
arr = np.random.rand(1000, 1000) arr[arr > 0.5] = np.nan np.nanmax(arr) 0.49999625409581072 np.fmax.reduce(arr, axis=None) <snip> TypeError: an integer is required np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0) 0.49999625409581072

...
...
timeit np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0) 100 loops, best of 3: 12.7 ms per loop timeit np.nanmax(arr) 10 loops, best of 3: 39.6 ms per loop

...
...
timeit np.nanmax(arr, axis=0) 10 loops, best of 3: 46.5 ms per loop timeit np.fmax.reduce(arr, axis=0) 100 loops, best of 3: 12.7 ms per loop

Cython is faster than np.fmax.reduce.

I wrote a cython version of np.nanmax, called nanmax below. (It only handles the 2d, float64, axis=None case, but since the array is large I don't think that explains the time difference).

Note that fmax.reduce is slower than np.nanmax when there are no NaNs:

...
...
arr = np.random.rand(1000, 1000) timeit np.nanmax(arr) 100 loops, best of 3: 5.82 ms per loop timeit np.fmax.reduce(np.fmax.reduce(arr)) 100 loops, best of 3: 9.14 ms per loop timeit nanmax(arr) 1000 loops, best of 3: 1.17 ms per loop

...
...
arr[arr > 0.5] = np.nan

...
...
timeit np.nanmax(arr) 10 loops, best of 3: 45.5 ms per loop timeit np.fmax.reduce(np.fmax.reduce(arr)) 100 loops, best of 3: 12.7 ms per loop timeit nanmax(arr) 1000 loops, best of 3: 1.17 ms per loop

There seem to be some odd hardware/compiler dependencies. I get quite a different pattern of times:

In [1]: arr = np.random.rand(1000, 1000)

In [2]: timeit np.nanmax(arr) 100 loops, best of 3: 10.4 ms per loop

In [3]: timeit np.fmax.reduce(arr.flat) 100 loops, best of 3: 2.09 ms per loop

In [4]: arr[arr > 0.5] = np.nan

In [5]: timeit np.nanmax(arr) 100 loops, best of 3: 12.9 ms per loop

In [6]: timeit np.fmax.reduce(arr.flat) 100 loops, best of 3: 7.09 ms per loop

I've tweaked fmax with the reduce loop option but the nanmax times don't look like yours at all. I'm also a bit surprised that you don't see any difference in times when the array contains a lot of nans. I'm running on AMD Phenom, gcc 4.4.5.

Ubuntu 10.04 64 bit, numpy 1.4.1. Difference in which times? nanny.nanmax with and wintout NaNs? The code doesn't explictily check for NaNs (it does check for all NaNs). It basically loops through the data and does: allnan = 1 ai = ai[i,k] if ai > amax: amax = ai allnan = 0 I should make a benchmark suite.

josef.pktd＠gmail.com

7:51 p.m.

On Fri, Nov 19, 2010 at 10:42 PM, Keith Goodman wrote:

...

On Fri, Nov 19, 2010 at 7:19 PM, Charles R Harris wrote:

...
On Fri, Nov 19, 2010 at 1:50 PM, Keith Goodman wrote:

...
On Fri, Nov 19, 2010 at 12:29 PM, Keith Goodman wrote:

...
On Fri, Nov 19, 2010 at 12:19 PM, Pauli Virtanen wrote:

...
Fri, 19 Nov 2010 11:19:57 -0800, Keith Goodman wrote: [clip]

...
My guess is that having separate underlying functions for each dtype, ndim, and axis would be a nightmare for a large project like Numpy. But manageable for a focused project like nanny.

Might be easier to migrate the nan* functions to using Ufuncs.

Unless I'm missing something,

np.nanmax -> np.fmax.reduce np.nanmin -> np.fmin.reduce

For `nansum`, we'd need to add an ufunc `nanadd`, and for `nanargmax/min`, we'd need `argfmin/fmax'.

How about that! I wasn't aware of fmax/fmin. Yes, I'd like a nanadd, please.

...
...
arr = np.random.rand(1000, 1000) arr[arr > 0.5] = np.nan np.nanmax(arr) 0.49999625409581072 np.fmax.reduce(arr, axis=None) <snip> TypeError: an integer is required np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0) 0.49999625409581072

...
...
timeit np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0) 100 loops, best of 3: 12.7 ms per loop timeit np.nanmax(arr) 10 loops, best of 3: 39.6 ms per loop

...
...
timeit np.nanmax(arr, axis=0) 10 loops, best of 3: 46.5 ms per loop timeit np.fmax.reduce(arr, axis=0) 100 loops, best of 3: 12.7 ms per loop

Cython is faster than np.fmax.reduce.

I wrote a cython version of np.nanmax, called nanmax below. (It only handles the 2d, float64, axis=None case, but since the array is large I don't think that explains the time difference).

Note that fmax.reduce is slower than np.nanmax when there are no NaNs:

...
...
arr = np.random.rand(1000, 1000) timeit np.nanmax(arr) 100 loops, best of 3: 5.82 ms per loop timeit np.fmax.reduce(np.fmax.reduce(arr)) 100 loops, best of 3: 9.14 ms per loop timeit nanmax(arr) 1000 loops, best of 3: 1.17 ms per loop

...
...
arr[arr > 0.5] = np.nan

...
...
timeit np.nanmax(arr) 10 loops, best of 3: 45.5 ms per loop timeit np.fmax.reduce(np.fmax.reduce(arr)) 100 loops, best of 3: 12.7 ms per loop timeit nanmax(arr) 1000 loops, best of 3: 1.17 ms per loop

There seem to be some odd hardware/compiler dependencies. I get quite a different pattern of times:

In [1]: arr = np.random.rand(1000, 1000)

In [2]: timeit np.nanmax(arr) 100 loops, best of 3: 10.4 ms per loop

In [3]: timeit np.fmax.reduce(arr.flat) 100 loops, best of 3: 2.09 ms per loop

In [4]: arr[arr > 0.5] = np.nan

In [5]: timeit np.nanmax(arr) 100 loops, best of 3: 12.9 ms per loop

In [6]: timeit np.fmax.reduce(arr.flat) 100 loops, best of 3: 7.09 ms per loop

I've tweaked fmax with the reduce loop option but the nanmax times don't look like yours at all. I'm also a bit surprised that you don't see any difference in times when the array contains a lot of nans. I'm running on AMD Phenom, gcc 4.4.5.

Ubuntu 10.04 64 bit, numpy 1.4.1.

Difference in which times? nanny.nanmax with and wintout NaNs? The code doesn't explictily check for NaNs (it does check for all NaNs). It basically loops through the data and does:

allnan = 1 ai = ai[i,k] if ai > amax: amax = ai allnan = 0

does this give you the correct answer?

...

...
...
1>np.nan False

What's the starting value for amax? -inf? Josef

...

I should make a benchmark suite. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Nathaniel Smith

7:57 p.m.

On Fri, Nov 19, 2010 at 7:51 PM, wrote:

...

On Fri, Nov 19, 2010 at 10:42 PM, Keith Goodman wrote:

...
It basically loops through the data and does:

allnan = 1 ai = ai[i,k] if ai > amax: amax = ai allnan = 0

does this give you the correct answer?

...
...
...
1>np.nan False

Yes -- notice he does the comparison the other way, and

...

...
...
1 < np.nan False

(All comparisons involving NaN return false, including, famously, NaN == NaN, which is why we need np.isnan.) -- Nathaniel

Keith Goodman

7:59 p.m.

On Fri, Nov 19, 2010 at 7:51 PM, wrote:

...

does this give you the correct answer?

...
...
...
1>np.nan False

What's the starting value for amax? -inf?

Because "1 > np.nan" is False, the current running max does not get updated, which is what we want.

...

...
import nanny as ny np.nanmax([1, np.nan]) 1.0 np.nanmax([np.nan, 1]) 1.0 np.nanmax([np.nan, 1, np.nan]) 1.0

Starting value is -np.inf for floats and stuff like this for ints: cdef np.int32_t MININT32 = np.iinfo(np.int32).min cdef np.int64_t MININT64 = np.iinfo(np.int64).min Numpy does this:

...

...
np.nanmax([]) <snip> ValueError: zero-size array to ufunc.reduce without identity

Nanny does this:

...

...
ny.nanmax([]) nan

So I haven't taken care of that corner case yet. I'll commit nanmax to github in case anyone wants to give it a try.

josef.pktd＠gmail.com

8:33 p.m.

On Fri, Nov 19, 2010 at 10:59 PM, Keith Goodman wrote:

...

On Fri, Nov 19, 2010 at 7:51 PM, wrote:

...
does this give you the correct answer?

...
...
...
1>np.nan False

What's the starting value for amax? -inf?

Because "1 > np.nan" is False, the current running max does not get updated, which is what we want.

...
...
import nanny as ny np.nanmax([1, np.nan]) 1.0 np.nanmax([np.nan, 1]) 1.0 np.nanmax([np.nan, 1, np.nan]) 1.0

Starting value is -np.inf for floats and stuff like this for ints:

cdef np.int32_t MININT32 = np.iinfo(np.int32).min cdef np.int64_t MININT64 = np.iinfo(np.int64).min

That's what I thought halfway through typing the question.

...

...
...
-np.inf>-np.inf False

If the only value is -np.inf, you will return nan, I guess.

...

...
...
np.nanmax([-np.inf, np.nan]) -inf

Josef (being picky)

...

Numpy does this:

...
...
np.nanmax([]) <snip> ValueError: zero-size array to ufunc.reduce without identity

Nanny does this:

...
...
ny.nanmax([]) nan

So I haven't taken care of that corner case yet. I'll commit nanmax to github in case anyone wants to give it a try. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Keith Goodman

9:29 p.m.

On Fri, Nov 19, 2010 at 8:33 PM, wrote:

...

...
...
...
-np.inf>-np.inf False

If the only value is -np.inf, you will return nan, I guess.

...
...
...
np.nanmax([-np.inf, np.nan]) -inf

That's a great corner case. Thanks, Josef. This looks like it would fix it: change if ai > amax: amax = ai to if ai >= amax: amax = ai

Charles R Harris

8:05 p.m.

On Fri, Nov 19, 2010 at 8:42 PM, Keith Goodman wrote:

...

On Fri, Nov 19, 2010 at 7:19 PM, Charles R Harris wrote:

...
On Fri, Nov 19, 2010 at 1:50 PM, Keith Goodman

...
...
On Fri, Nov 19, 2010 at 12:29 PM, Keith Goodman wrote:

...
On Fri, Nov 19, 2010 at 12:19 PM, Pauli Virtanen wrote:

...
Fri, 19 Nov 2010 11:19:57 -0800, Keith Goodman wrote: [clip]

...
My guess is that having separate underlying functions for each

wrote: dtype,

...
...
...
...
...
ndim, and axis would be a nightmare for a large project like Numpy. But manageable for a focused project like nanny.

Might be easier to migrate the nan* functions to using Ufuncs.

Unless I'm missing something,

np.nanmax -> np.fmax.reduce np.nanmin -> np.fmin.reduce

For `nansum`, we'd need to add an ufunc `nanadd`, and for `nanargmax/min`, we'd need `argfmin/fmax'.

How about that! I wasn't aware of fmax/fmin. Yes, I'd like a nanadd, please.

...
...
arr = np.random.rand(1000, 1000) arr[arr > 0.5] = np.nan np.nanmax(arr) 0.49999625409581072 np.fmax.reduce(arr, axis=None) <snip> TypeError: an integer is required np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0) 0.49999625409581072

...
...
timeit np.fmax.reduce(np.fmax.reduce(arr, axis=0), axis=0) 100 loops, best of 3: 12.7 ms per loop timeit np.nanmax(arr) 10 loops, best of 3: 39.6 ms per loop

...
...
timeit np.nanmax(arr, axis=0) 10 loops, best of 3: 46.5 ms per loop timeit np.fmax.reduce(arr, axis=0) 100 loops, best of 3: 12.7 ms per loop

Cython is faster than np.fmax.reduce.

I wrote a cython version of np.nanmax, called nanmax below. (It only handles the 2d, float64, axis=None case, but since the array is large I don't think that explains the time difference).

Note that fmax.reduce is slower than np.nanmax when there are no NaNs:

...
...
arr = np.random.rand(1000, 1000) timeit np.nanmax(arr) 100 loops, best of 3: 5.82 ms per loop timeit np.fmax.reduce(np.fmax.reduce(arr)) 100 loops, best of 3: 9.14 ms per loop timeit nanmax(arr) 1000 loops, best of 3: 1.17 ms per loop

...
...
arr[arr > 0.5] = np.nan

...
...
timeit np.nanmax(arr) 10 loops, best of 3: 45.5 ms per loop timeit np.fmax.reduce(np.fmax.reduce(arr)) 100 loops, best of 3: 12.7 ms per loop timeit nanmax(arr) 1000 loops, best of 3: 1.17 ms per loop

There seem to be some odd hardware/compiler dependencies. I get quite a different pattern of times:

In [1]: arr = np.random.rand(1000, 1000)

In [2]: timeit np.nanmax(arr) 100 loops, best of 3: 10.4 ms per loop

In [3]: timeit np.fmax.reduce(arr.flat) 100 loops, best of 3: 2.09 ms per loop

In [4]: arr[arr > 0.5] = np.nan

In [5]: timeit np.nanmax(arr) 100 loops, best of 3: 12.9 ms per loop

In [6]: timeit np.fmax.reduce(arr.flat) 100 loops, best of 3: 7.09 ms per loop

I've tweaked fmax with the reduce loop option but the nanmax times don't look like yours at all. I'm also a bit surprised that you don't see any difference in times when the array contains a lot of nans. I'm running on AMD Phenom, gcc 4.4.5.

Ubuntu 10.04 64 bit, numpy 1.4.1.

Difference in which times? nanny.nanmax with and wintout NaNs? The code doesn't explictily check for NaNs (it does check for all NaNs). It basically loops through the data and does:

allnan = 1 ai = ai[i,k] if ai > amax: amax = ai allnan = 0

I should make a benchmark suite. _

This doesn't look right: @cython.boundscheck(False) @cython.wraparound(False) def nanmax_2d_float64_axisNone(np.ndarray[np.float64_t, ndim=2] a): "nanmax of 2d numpy array with dtype=np.float64 along axis=None." cdef Py_ssize_t i, j cdef int arow = a.shape[0], acol = a.shape[1], allnan = 1 cdef np.float64_t amax = 0, aij for i in range(arow): for j in range(acol): aij = a[i,j] if aij == aij: amax += aij allnan = 0 if allnan == 0: return np.float64(amax) else: return NAN It's doing a sum, not a comparison. Chuck

Keith Goodman

8:07 p.m.

On Fri, Nov 19, 2010 at 8:05 PM, Charles R Harris wrote:

...

This doesn't look right:

@cython.boundscheck(False) @cython.wraparound(False) def nanmax_2d_float64_axisNone(np.ndarray[np.float64_t, ndim=2] a): "nanmax of 2d numpy array with dtype=np.float64 along axis=None." cdef Py_ssize_t i, j cdef int arow = a.shape[0], acol = a.shape[1], allnan = 1 cdef np.float64_t amax = 0, aij for i in range(arow): for j in range(acol): aij = a[i,j] if aij == aij: amax += aij allnan = 0 if allnan == 0: return np.float64(amax) else: return NAN

It's doing a sum, not a comparison.

That was a placeholder. Looks at the latest commit. Sorry for the confusion.

Keith Goodman

20 Nov 20 Nov

3:39 p.m.

On Fri, Nov 19, 2010 at 7:42 PM, Keith Goodman wrote:

...

I should make a benchmark suite.

...

...
ny.benchit(verbose=False) Nanny performance benchmark Nanny 0.0.1dev Numpy 1.4.1 Speed is numpy time divided by nanny time NaN means all NaNs Speed Test Shape dtype NaN? 6.6770 nansum(a, axis=-1) (500,500) int64 4.6612 nansum(a, axis=-1) (10000,) float64 9.0351 nansum(a, axis=-1) (500,500) int32 3.0746 nansum(a, axis=-1) (500,500) float64 11.5740 nansum(a, axis=-1) (10000,) int32 6.4484 nansum(a, axis=-1) (10000,) int64 51.3917 nansum(a, axis=-1) (500,500) float64 NaN 13.8692 nansum(a, axis=-1) (10000,) float64 NaN 6.5327 nanmax(a, axis=-1) (500,500) int64 8.8222 nanmax(a, axis=-1) (10000,) float64 0.2059 nanmax(a, axis=-1) (500,500) int32 6.9262 nanmax(a, axis=-1) (500,500) float64 5.0688 nanmax(a, axis=-1) (10000,) int32 6.5605 nanmax(a, axis=-1) (10000,) int64 48.4850 nanmax(a, axis=-1) (500,500) float64 NaN 14.6289 nanmax(a, axis=-1) (10000,) float64 NaN

You can also use the makefile to run the benchmark: make bench

Wes McKinney

3:54 p.m.

On Sat, Nov 20, 2010 at 6:39 PM, Keith Goodman wrote:

...

On Fri, Nov 19, 2010 at 7:42 PM, Keith Goodman wrote:

...
I should make a benchmark suite.

...
...
ny.benchit(verbose=False) Nanny performance benchmark Nanny 0.0.1dev Numpy 1.4.1 Speed is numpy time divided by nanny time NaN means all NaNs Speed Test Shape dtype NaN? 6.6770 nansum(a, axis=-1) (500,500) int64 4.6612 nansum(a, axis=-1) (10000,) float64 9.0351 nansum(a, axis=-1) (500,500) int32 3.0746 nansum(a, axis=-1) (500,500) float64 11.5740 nansum(a, axis=-1) (10000,) int32 6.4484 nansum(a, axis=-1) (10000,) int64 51.3917 nansum(a, axis=-1) (500,500) float64 NaN 13.8692 nansum(a, axis=-1) (10000,) float64 NaN 6.5327 nanmax(a, axis=-1) (500,500) int64 8.8222 nanmax(a, axis=-1) (10000,) float64 0.2059 nanmax(a, axis=-1) (500,500) int32 6.9262 nanmax(a, axis=-1) (500,500) float64 5.0688 nanmax(a, axis=-1) (10000,) int32 6.5605 nanmax(a, axis=-1) (10000,) int64 48.4850 nanmax(a, axis=-1) (500,500) float64 NaN 14.6289 nanmax(a, axis=-1) (10000,) float64 NaN

You can also use the makefile to run the benchmark: make bench _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Keith (and others), What would you think about creating a library of mostly Cython-based "domain specific functions"? So stuff like rolling statistical moments, nan* functions like you have here, and all that-- NumPy-array only functions that don't necessarily belong in NumPy or SciPy (but could be included on down the road). You were already talking about this on the statsmodels mailing list for larry. I spent a lot of time writing a bunch of these for pandas over the last couple of years, and I would have relatively few qualms about moving these outside of pandas and introducing a dependency. You could do the same for larry-- then we'd all be relying on the same well-vetted and tested codebase. - Wes

Wes McKinney

3:56 p.m.

On Sat, Nov 20, 2010 at 6:54 PM, Wes McKinney wrote:

...

On Sat, Nov 20, 2010 at 6:39 PM, Keith Goodman wrote:

...
On Fri, Nov 19, 2010 at 7:42 PM, Keith Goodman wrote:

...
I should make a benchmark suite.

...
...
ny.benchit(verbose=False) Nanny performance benchmark Nanny 0.0.1dev Numpy 1.4.1 Speed is numpy time divided by nanny time NaN means all NaNs Speed Test Shape dtype NaN? 6.6770 nansum(a, axis=-1) (500,500) int64 4.6612 nansum(a, axis=-1) (10000,) float64 9.0351 nansum(a, axis=-1) (500,500) int32 3.0746 nansum(a, axis=-1) (500,500) float64 11.5740 nansum(a, axis=-1) (10000,) int32 6.4484 nansum(a, axis=-1) (10000,) int64 51.3917 nansum(a, axis=-1) (500,500) float64 NaN 13.8692 nansum(a, axis=-1) (10000,) float64 NaN 6.5327 nanmax(a, axis=-1) (500,500) int64 8.8222 nanmax(a, axis=-1) (10000,) float64 0.2059 nanmax(a, axis=-1) (500,500) int32 6.9262 nanmax(a, axis=-1) (500,500) float64 5.0688 nanmax(a, axis=-1) (10000,) int32 6.5605 nanmax(a, axis=-1) (10000,) int64 48.4850 nanmax(a, axis=-1) (500,500) float64 NaN 14.6289 nanmax(a, axis=-1) (10000,) float64 NaN

You can also use the makefile to run the benchmark: make bench _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Keith (and others),

What would you think about creating a library of mostly Cython-based "domain specific functions"? So stuff like rolling statistical moments, nan* functions like you have here, and all that-- NumPy-array only functions that don't necessarily belong in NumPy or SciPy (but could be included on down the road). You were already talking about this on the statsmodels mailing list for larry. I spent a lot of time writing a bunch of these for pandas over the last couple of years, and I would have relatively few qualms about moving these outside of pandas and introducing a dependency. You could do the same for larry-- then we'd all be relying on the same well-vetted and tested codebase.

- Wes

By the way I wouldn't mind pushing all of my datetime-related code (date range generation, date offsets, etc.) into this new library, too.

Keith Goodman

4:24 p.m.

On Sat, Nov 20, 2010 at 3:54 PM, Wes McKinney wrote:

...

Keith (and others),

What would you think about creating a library of mostly Cython-based "domain specific functions"? So stuff like rolling statistical moments, nan* functions like you have here, and all that-- NumPy-array only functions that don't necessarily belong in NumPy or SciPy (but could be included on down the road). You were already talking about this on the statsmodels mailing list for larry. I spent a lot of time writing a bunch of these for pandas over the last couple of years, and I would have relatively few qualms about moving these outside of pandas and introducing a dependency. You could do the same for larry-- then we'd all be relying on the same well-vetted and tested codebase.

I've started working on moving window statistics cython functions. I plan to make it into a package called Roly (for rolling). The signatures are: mov_sum(arr, window, axis=-1) and mov_nansum(arr, window, axis=-1), etc. I think of Nanny and Roly as two separate packages. A narrow focus is good for a new package. But maybe each package could be a subpackage in a super package? Would the function signatures in Nanny (exact duplicates of the corresponding functions in Numpy and Scipy) work for pandas? I plan to use Nanny in larry. I'll try to get the structure of the Nanny package in place. But if it doesn't attract any interest after that then I may fold it into larry.

Wes McKinney

21 Nov 21 Nov

10:25 a.m.

On Sat, Nov 20, 2010 at 7:24 PM, Keith Goodman wrote:

...

On Sat, Nov 20, 2010 at 3:54 PM, Wes McKinney wrote:

...
Keith (and others),

What would you think about creating a library of mostly Cython-based "domain specific functions"? So stuff like rolling statistical moments, nan* functions like you have here, and all that-- NumPy-array only functions that don't necessarily belong in NumPy or SciPy (but could be included on down the road). You were already talking about this on the statsmodels mailing list for larry. I spent a lot of time writing a bunch of these for pandas over the last couple of years, and I would have relatively few qualms about moving these outside of pandas and introducing a dependency. You could do the same for larry-- then we'd all be relying on the same well-vetted and tested codebase.

I've started working on moving window statistics cython functions. I plan to make it into a package called Roly (for rolling). The signatures are: mov_sum(arr, window, axis=-1) and mov_nansum(arr, window, axis=-1), etc.

I think of Nanny and Roly as two separate packages. A narrow focus is good for a new package. But maybe each package could be a subpackage in a super package?

Would the function signatures in Nanny (exact duplicates of the corresponding functions in Numpy and Scipy) work for pandas? I plan to use Nanny in larry. I'll try to get the structure of the Nanny package in place. But if it doesn't attract any interest after that then I may fold it into larry. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Why make multiple packages? It seems like all these functions are somewhat related: practical tools for real-world data analysis (where observations are often missing). I suspect having everything under one hood would create more interest than chopping things up-- would be very useful to folks in many different disciplines (finance, economics, statistics, etc.). In R, for example, NA-handling is just a part of every day life. Of course in R there is a special NA value which is distinct from NaN-- many folks object to the use of NaN for missing values. The alternative is masked arrays, but in my case I wasn't willing to sacrifice so much performance for purity's sake. I could certainly use the nan* functions to replace code in pandas where I've handled things in a somewhat ad hoc way.

Keith Goodman

11:48 a.m.

On Sun, Nov 21, 2010 at 10:25 AM, Wes McKinney wrote:

...

On Sat, Nov 20, 2010 at 7:24 PM, Keith Goodman wrote:

...
On Sat, Nov 20, 2010 at 3:54 PM, Wes McKinney wrote:

...
Keith (and others),

What would you think about creating a library of mostly Cython-based "domain specific functions"? So stuff like rolling statistical moments, nan* functions like you have here, and all that-- NumPy-array only functions that don't necessarily belong in NumPy or SciPy (but could be included on down the road). You were already talking about this on the statsmodels mailing list for larry. I spent a lot of time writing a bunch of these for pandas over the last couple of years, and I would have relatively few qualms about moving these outside of pandas and introducing a dependency. You could do the same for larry-- then we'd all be relying on the same well-vetted and tested codebase.

I've started working on moving window statistics cython functions. I plan to make it into a package called Roly (for rolling). The signatures are: mov_sum(arr, window, axis=-1) and mov_nansum(arr, window, axis=-1), etc.

I think of Nanny and Roly as two separate packages. A narrow focus is good for a new package. But maybe each package could be a subpackage in a super package?

Would the function signatures in Nanny (exact duplicates of the corresponding functions in Numpy and Scipy) work for pandas? I plan to use Nanny in larry. I'll try to get the structure of the Nanny package in place. But if it doesn't attract any interest after that then I may fold it into larry. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Why make multiple packages? It seems like all these functions are somewhat related: practical tools for real-world data analysis (where observations are often missing). I suspect having everything under one hood would create more interest than chopping things up-- would be very useful to folks in many different disciplines (finance, economics, statistics, etc.). In R, for example, NA-handling is just a part of every day life. Of course in R there is a special NA value which is distinct from NaN-- many folks object to the use of NaN for missing values. The alternative is masked arrays, but in my case I wasn't willing to sacrifice so much performance for purity's sake.

I could certainly use the nan* functions to replace code in pandas where I've handled things in a somewhat ad hoc way.

A package focused on NaN-aware functions sounds like a good idea. I think a good plan would be to start by making faster, drop-in replacements for the NaN functions that are already in numpy and scipy. That is already a lot of work. After that, one possibility is to add stuff like nancumsum, nanprod, etc. After that moving window stuff?

josef.pktd＠gmail.com

12:30 p.m.

On Sun, Nov 21, 2010 at 2:48 PM, Keith Goodman wrote:

...

On Sun, Nov 21, 2010 at 10:25 AM, Wes McKinney wrote:

...
On Sat, Nov 20, 2010 at 7:24 PM, Keith Goodman wrote:

...
On Sat, Nov 20, 2010 at 3:54 PM, Wes McKinney wrote:

...
Keith (and others),

What would you think about creating a library of mostly Cython-based "domain specific functions"? So stuff like rolling statistical moments, nan* functions like you have here, and all that-- NumPy-array only functions that don't necessarily belong in NumPy or SciPy (but could be included on down the road). You were already talking about this on the statsmodels mailing list for larry. I spent a lot of time writing a bunch of these for pandas over the last couple of years, and I would have relatively few qualms about moving these outside of pandas and introducing a dependency. You could do the same for larry-- then we'd all be relying on the same well-vetted and tested codebase.

I've started working on moving window statistics cython functions. I plan to make it into a package called Roly (for rolling). The signatures are: mov_sum(arr, window, axis=-1) and mov_nansum(arr, window, axis=-1), etc.

I think of Nanny and Roly as two separate packages. A narrow focus is good for a new package. But maybe each package could be a subpackage in a super package?

Would the function signatures in Nanny (exact duplicates of the corresponding functions in Numpy and Scipy) work for pandas? I plan to use Nanny in larry. I'll try to get the structure of the Nanny package in place. But if it doesn't attract any interest after that then I may fold it into larry. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Why make multiple packages? It seems like all these functions are somewhat related: practical tools for real-world data analysis (where observations are often missing). I suspect having everything under one hood would create more interest than chopping things up-- would be very useful to folks in many different disciplines (finance, economics, statistics, etc.). In R, for example, NA-handling is just a part of every day life. Of course in R there is a special NA value which is distinct from NaN-- many folks object to the use of NaN for missing values. The alternative is masked arrays, but in my case I wasn't willing to sacrifice so much performance for purity's sake.

I could certainly use the nan* functions to replace code in pandas where I've handled things in a somewhat ad hoc way.

A package focused on NaN-aware functions sounds like a good idea. I think a good plan would be to start by making faster, drop-in replacements for the NaN functions that are already in numpy and scipy. That is already a lot of work. After that, one possibility is to add stuff like nancumsum, nanprod, etc. After that moving window stuff?

and maybe group functions after that? If there is a lot of repetition, you could use templating. Even simple string substitution, if it is only replacing the dtype, works pretty well. It would at least reduce some copy-paste. Josef

...

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Keith Goodman

2:09 p.m.

On Sun, Nov 21, 2010 at 12:30 PM, wrote:

...

On Sun, Nov 21, 2010 at 2:48 PM, Keith Goodman wrote:

...
On Sun, Nov 21, 2010 at 10:25 AM, Wes McKinney wrote:

...
On Sat, Nov 20, 2010 at 7:24 PM, Keith Goodman wrote:

...
On Sat, Nov 20, 2010 at 3:54 PM, Wes McKinney wrote:

...
Keith (and others),

What would you think about creating a library of mostly Cython-based "domain specific functions"? So stuff like rolling statistical moments, nan* functions like you have here, and all that-- NumPy-array only functions that don't necessarily belong in NumPy or SciPy (but could be included on down the road). You were already talking about this on the statsmodels mailing list for larry. I spent a lot of time writing a bunch of these for pandas over the last couple of years, and I would have relatively few qualms about moving these outside of pandas and introducing a dependency. You could do the same for larry-- then we'd all be relying on the same well-vetted and tested codebase.

I've started working on moving window statistics cython functions. I plan to make it into a package called Roly (for rolling). The signatures are: mov_sum(arr, window, axis=-1) and mov_nansum(arr, window, axis=-1), etc.

I think of Nanny and Roly as two separate packages. A narrow focus is good for a new package. But maybe each package could be a subpackage in a super package?

Would the function signatures in Nanny (exact duplicates of the corresponding functions in Numpy and Scipy) work for pandas? I plan to use Nanny in larry. I'll try to get the structure of the Nanny package in place. But if it doesn't attract any interest after that then I may fold it into larry. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Why make multiple packages? It seems like all these functions are somewhat related: practical tools for real-world data analysis (where observations are often missing). I suspect having everything under one hood would create more interest than chopping things up-- would be very useful to folks in many different disciplines (finance, economics, statistics, etc.). In R, for example, NA-handling is just a part of every day life. Of course in R there is a special NA value which is distinct from NaN-- many folks object to the use of NaN for missing values. The alternative is masked arrays, but in my case I wasn't willing to sacrifice so much performance for purity's sake.

I could certainly use the nan* functions to replace code in pandas where I've handled things in a somewhat ad hoc way.

A package focused on NaN-aware functions sounds like a good idea. I think a good plan would be to start by making faster, drop-in replacements for the NaN functions that are already in numpy and scipy. That is already a lot of work. After that, one possibility is to add stuff like nancumsum, nanprod, etc. After that moving window stuff?

and maybe group functions after that?

Yes, group functions are on my list.

...

If there is a lot of repetition, you could use templating. Even simple string substitution, if it is only replacing the dtype, works pretty well. It would at least reduce some copy-paste.

Unit test coverage should be good enough to mess around with trying templating. What's a good way to go? Write my own script that creates the .pyx file and call it from the make file? Or are there packages for doing the templating? I added nanmean (the first scipy function to enter nanny) and nanmin.

josef.pktd＠gmail.com

3:02 p.m.

On Sun, Nov 21, 2010 at 5:09 PM, Keith Goodman wrote:

...

On Sun, Nov 21, 2010 at 12:30 PM, wrote:

...
On Sun, Nov 21, 2010 at 2:48 PM, Keith Goodman wrote:

...
On Sun, Nov 21, 2010 at 10:25 AM, Wes McKinney wrote:

...
On Sat, Nov 20, 2010 at 7:24 PM, Keith Goodman wrote:

...
On Sat, Nov 20, 2010 at 3:54 PM, Wes McKinney wrote:

...
Keith (and others),

What would you think about creating a library of mostly Cython-based "domain specific functions"? So stuff like rolling statistical moments, nan* functions like you have here, and all that-- NumPy-array only functions that don't necessarily belong in NumPy or SciPy (but could be included on down the road). You were already talking about this on the statsmodels mailing list for larry. I spent a lot of time writing a bunch of these for pandas over the last couple of years, and I would have relatively few qualms about moving these outside of pandas and introducing a dependency. You could do the same for larry-- then we'd all be relying on the same well-vetted and tested codebase.

I've started working on moving window statistics cython functions. I plan to make it into a package called Roly (for rolling). The signatures are: mov_sum(arr, window, axis=-1) and mov_nansum(arr, window, axis=-1), etc.

I think of Nanny and Roly as two separate packages. A narrow focus is good for a new package. But maybe each package could be a subpackage in a super package?

Would the function signatures in Nanny (exact duplicates of the corresponding functions in Numpy and Scipy) work for pandas? I plan to use Nanny in larry. I'll try to get the structure of the Nanny package in place. But if it doesn't attract any interest after that then I may fold it into larry. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Why make multiple packages? It seems like all these functions are somewhat related: practical tools for real-world data analysis (where observations are often missing). I suspect having everything under one hood would create more interest than chopping things up-- would be very useful to folks in many different disciplines (finance, economics, statistics, etc.). In R, for example, NA-handling is just a part of every day life. Of course in R there is a special NA value which is distinct from NaN-- many folks object to the use of NaN for missing values. The alternative is masked arrays, but in my case I wasn't willing to sacrifice so much performance for purity's sake.

I could certainly use the nan* functions to replace code in pandas where I've handled things in a somewhat ad hoc way.

A package focused on NaN-aware functions sounds like a good idea. I think a good plan would be to start by making faster, drop-in replacements for the NaN functions that are already in numpy and scipy. That is already a lot of work. After that, one possibility is to add stuff like nancumsum, nanprod, etc. After that moving window stuff?

and maybe group functions after that?

Yes, group functions are on my list.

...
If there is a lot of repetition, you could use templating. Even simple string substitution, if it is only replacing the dtype, works pretty well. It would at least reduce some copy-paste.

Unit test coverage should be good enough to mess around with trying templating. What's a good way to go? Write my own script that creates the .pyx file and call it from the make file? Or are there packages for doing the templating?

Depends on the scale, I tried once with simple string templates http://codespeak.net/pipermail/cython-dev/2009-August/006614.html here is a pastbin of another version by ....(?), http://pastebin.com/f1a49143d discussed on the cython-dev mailing list. The cython list has the discussion every once in a while but I haven't seen any conclusion yet. For heavier duty templating a proper templating package (Jinja?) might be better. I'm not an expert. Josef

...

I added nanmean (the first scipy function to enter nanny) and nanmin. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Wes McKinney

3:16 p.m.

On Sun, Nov 21, 2010 at 6:02 PM, wrote:

...

On Sun, Nov 21, 2010 at 5:09 PM, Keith Goodman wrote:

...
On Sun, Nov 21, 2010 at 12:30 PM, wrote:

...
On Sun, Nov 21, 2010 at 2:48 PM, Keith Goodman wrote:

...
On Sun, Nov 21, 2010 at 10:25 AM, Wes McKinney wrote:

...
On Sat, Nov 20, 2010 at 7:24 PM, Keith Goodman wrote:

...
On Sat, Nov 20, 2010 at 3:54 PM, Wes McKinney wrote:

> Keith (and others), > > What would you think about creating a library of mostly Cython-based > "domain specific functions"? So stuff like rolling statistical > moments, nan* functions like you have here, and all that-- NumPy-array > only functions that don't necessarily belong in NumPy or SciPy (but > could be included on down the road). You were already talking about > this on the statsmodels mailing list for larry. I spent a lot of time > writing a bunch of these for pandas over the last couple of years, and > I would have relatively few qualms about moving these outside of > pandas and introducing a dependency. You could do the same for larry-- > then we'd all be relying on the same well-vetted and tested codebase.

I've started working on moving window statistics cython functions. I plan to make it into a package called Roly (for rolling). The signatures are: mov_sum(arr, window, axis=-1) and mov_nansum(arr, window, axis=-1), etc.

I think of Nanny and Roly as two separate packages. A narrow focus is good for a new package. But maybe each package could be a subpackage in a super package?

Would the function signatures in Nanny (exact duplicates of the corresponding functions in Numpy and Scipy) work for pandas? I plan to use Nanny in larry. I'll try to get the structure of the Nanny package in place. But if it doesn't attract any interest after that then I may fold it into larry. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Why make multiple packages? It seems like all these functions are somewhat related: practical tools for real-world data analysis (where observations are often missing). I suspect having everything under one hood would create more interest than chopping things up-- would be very useful to folks in many different disciplines (finance, economics, statistics, etc.). In R, for example, NA-handling is just a part of every day life. Of course in R there is a special NA value which is distinct from NaN-- many folks object to the use of NaN for missing values. The alternative is masked arrays, but in my case I wasn't willing to sacrifice so much performance for purity's sake.

I could certainly use the nan* functions to replace code in pandas where I've handled things in a somewhat ad hoc way.

A package focused on NaN-aware functions sounds like a good idea. I think a good plan would be to start by making faster, drop-in replacements for the NaN functions that are already in numpy and scipy. That is already a lot of work. After that, one possibility is to add stuff like nancumsum, nanprod, etc. After that moving window stuff?

and maybe group functions after that?

Yes, group functions are on my list.

...
If there is a lot of repetition, you could use templating. Even simple string substitution, if it is only replacing the dtype, works pretty well. It would at least reduce some copy-paste.

Unit test coverage should be good enough to mess around with trying templating. What's a good way to go? Write my own script that creates the .pyx file and call it from the make file? Or are there packages for doing the templating?

Depends on the scale, I tried once with simple string templates http://codespeak.net/pipermail/cython-dev/2009-August/006614.html

here is a pastbin of another version by ....(?), http://pastebin.com/f1a49143d discussed on the cython-dev mailing list.

The cython list has the discussion every once in a while but I haven't seen any conclusion yet. For heavier duty templating a proper templating package (Jinja?) might be better.

I'm not an expert.

Josef

...
I added nanmean (the first scipy function to enter nanny) and nanmin. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

What would you say to a single package that contains: - NaN-aware NumPy and SciPy functions (nanmean, nanmin, etc.) - moving window functions (moving_{count, sum, mean, var, std, etc.}) - core subroutines for labeled data - group-by functions - other things to add to this list? In other words, basic building computational tools for making libraries like larry, pandas, etc. and doing time series / statistical / other manipulations on real world (messy) data sets. The focus isn't so much "NaN-awareness" per se but more practical "data wrangling". I would be happy to work on such a package and to move all the Cython code I've written into it. There's a little bit of datarray overlap potentially but I think that's OK

Keith Goodman

3:37 p.m.

On Sun, Nov 21, 2010 at 3:16 PM, Wes McKinney wrote:

...

What would you say to a single package that contains:

- NaN-aware NumPy and SciPy functions (nanmean, nanmin, etc.)

I'd say yes.

...

- moving window functions (moving_{count, sum, mean, var, std, etc.})

Yes. BTW, we both do arr=arr.astype(float), I think, before doing the moving statistics. So I speeded things up by running the moving window backwards and writing the result in place.

...

- core subroutines for labeled data

Not sure what this would be. Let's discuss.

...

- group-by functions

Yes. I have some ideas on function signatures.

...

- other things to add to this list?

A no-op function with a really long doc string!

...

In other words, basic building computational tools for making libraries like larry, pandas, etc. and doing time series / statistical / other manipulations on real world (messy) data sets. The focus isn't so much "NaN-awareness" per se but more practical "data wrangling". I would be happy to work on such a package and to move all the Cython code I've written into it. There's a little bit of datarray overlap potentially but I think that's OK

Maybe we should make a list of function signatures along with brief doc strings to get a feel for what we (and hopefully others) have in mind? Where should we continue the discussion? The pystatsmodels mailing list? By now the numpy list probably thinks of NaN as "Not ANother" email from this guy.

Wes McKinney

6:03 p.m.

On Sun, Nov 21, 2010 at 6:37 PM, Keith Goodman wrote:

...

On Sun, Nov 21, 2010 at 3:16 PM, Wes McKinney wrote:

...
What would you say to a single package that contains:

- NaN-aware NumPy and SciPy functions (nanmean, nanmin, etc.)

I'd say yes.

...
- moving window functions (moving_{count, sum, mean, var, std, etc.})

Yes.

BTW, we both do arr=arr.astype(float), I think, before doing the moving statistics. So I speeded things up by running the moving window backwards and writing the result in place.

...
- core subroutines for labeled data

Not sure what this would be. Let's discuss.

Basically want to produce a indexing vector based on rules-- something to pass to ndarray.take later on. And maybe your generic binary-op function from a while back?

...

...
- group-by functions

Yes. I have some ideas on function signatures.

...
- other things to add to this list?

A no-op function with a really long doc string!

...
In other words, basic building computational tools for making libraries like larry, pandas, etc. and doing time series / statistical / other manipulations on real world (messy) data sets. The focus isn't so much "NaN-awareness" per se but more practical "data wrangling". I would be happy to work on such a package and to move all the Cython code I've written into it. There's a little bit of datarray overlap potentially but I think that's OK

Maybe we should make a list of function signatures along with brief doc strings to get a feel for what we (and hopefully others) have in mind?

I've personally never been much for writing specs, but could be useful. We probably aren't going to get it all right on the first try, so we'll just do our best and refactor the code later if necessary. We might be well-served by collecting exemplary data sets and making a list of things we would like to be able to do easily with that data. But writing stuff like: moving_{funcname}(ndarray data, int window, int axis=0, int min_periods=window) -> ndarray group_aggregate(ndarray data, ndarray labels, int axis=0, function agg_function) -> ndarray group_transform(...) ... etc. makes sense

...

Where should we continue the discussion? The pystatsmodels mailing list? By now the numpy list probably thinks of NaN as "Not ANother" email from this guy. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Maybe let's have the next thread on SciPy-user-- I think what we're talking about is general enough to be discussed there.

Gael Varoquaux

22 Nov 22 Nov

12:46 a.m.

On Sun, Nov 21, 2010 at 09:03:22PM -0500, Wes McKinney wrote:

...

Maybe let's have the next thread on SciPy-user-- I think what we're talking about is general enough to be discussed there.

Yes, a lot of this is of general interest. I'd be particularly interested in having the NaN work land in scipy. They are many places where it would be useful (I understand that it would require more work). G

Gael Varoquaux

12:03 p.m.

New subject: N dimensional dichotomy optimization

Hi list, does anybody have, or knows where I can find some N dimensional dichotomy optimization code in Python (BSD licensed, or equivalent)? Worst case, it does not look too bad to code, but I am interested by any advice. I haven't done my reading yet, and I don't know how ill-posed a problem it is. I had in mind starting from a set of points and iterating the computation of the objective function's value at the barycenters of these points, and updating this list of points. This does raise a few questions on what are the best possible updates. Cheers, Gael

Matthieu Brucher

12:12 p.m.

New subject: N dimensional dichotomy optimization

2010/11/22 Gael Varoquaux :

...

Hi list,

Hi ;)

...

does anybody have, or knows where I can find some N dimensional dichotomy optimization code in Python (BSD licensed, or equivalent)?

I don't know any code, but it should be too difficult by bgoing through a KdTree.

...

Worst case, it does not look too bad to code, but I am interested by any advice. I haven't done my reading yet, and I don't know how ill-posed a problem it is. I had in mind starting from a set of points and iterating the computation of the objective function's value at the barycenters of these points, and updating this list of points. This does raise a few questions on what are the best possible updates.

In this case, you may want to check Nelder-Mead algotihm (also known as down-hill simplex or polytope), which is available in scikits.optimization, but there are other implementations out there. Cheers ;) Matthieu -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher

Gael Varoquaux

1:57 p.m.

New subject: N dimensional dichotomy optimization

On Mon, Nov 22, 2010 at 09:12:45PM +0100, Matthieu Brucher wrote:

...

Hi ;)

Hi bro

...

...
does anybody have, or knows where I can find some N dimensional dichotomy optimization code in Python (BSD licensed, or equivalent)?

...

I don't know any code, but it should be too difficult by bgoing through a KdTree.

I am not in terribly high-dimensional spaces, so I don't really need to use a KdTree (but we do happen to have a nice BallTree available in the scikit-learn, so I could use it just to play :>).

...

...
Worst case, it does not look too bad to code, but I am interested by any advice. I haven't done my reading yet, and I don't know how ill-posed a problem it is. I had in mind starting from a set of points and iterating the computation of the objective function's value at the barycenters of these points, and updating this list of points. This does raise a few questions on what are the best possible updates.

...

In this case, you may want to check Nelder-Mead algotihm (also known as down-hill simplex or polytope), which is available in scikits.optimization, but there are other implementations out there.

Interesting reference. I had never looked at the Nelder-Mead algorithm. I am wondering if it does what I want, thought. The reason I am looking at dichotomy optimization is that the objective function that I want to optimize has local roughness, but is globally pretty much a bell-shaped curve. Dichotomy looks like it will get quite close to the top of the curve (and I have been told that it works well on such problems). One thing that is nice with dichotomy for my needs is that it is not based on a gradient, and it works in a convex of the parameter space. Will the Nelder-Mead display such properties? It seems so to me, but I don't trust my quick read through of Wikipedia. I realize that maybe I should rephrase my question to try and draw more out of the common wealth of knowledge on this mailing list: what do people suggest to tackle this problem? Guided by Matthieu's suggestion, I have started looking at Powell's algorithm, and given the introduction of www.damtp.cam.ac.uk/user/na/NA_papers/NA2007_03.pdf I am wondering whether I should not investigate it. Can people provide any insights on these problems. Many thanks, Gael PS: The reason I am looking at this optimization problem is that I got tired of looking at grid searches optimize the cross-validation score on my 3-parameter estimator (not in the scikit-learn, because it is way too specific to my problems).

Matthieu Brucher

2:12 p.m.

New subject: N dimensional dichotomy optimization

2010/11/22 Gael Varoquaux :

...

On Mon, Nov 22, 2010 at 09:12:45PM +0100, Matthieu Brucher wrote:

...
Hi ;)

Hi bro

...
...
does anybody have, or knows where I can find some N dimensional dichotomy optimization code in Python (BSD licensed, or equivalent)?

...
I don't know any code, but it should be too difficult by bgoing through a KdTree.

I am not in terribly high-dimensional spaces, so I don't really need to use a KdTree (but we do happen to have a nice BallTree available in the scikit-learn, so I could use it just to play :>).

...

...
In this case, you may want to check Nelder-Mead algotihm (also known as down-hill simplex or polytope), which is available in scikits.optimization, but there are other implementations out there.

Interesting reference. I had never looked at the Nelder-Mead algorithm. I am wondering if it does what I want, thought.

The reason I am looking at dichotomy optimization is that the objective function that I want to optimize has local roughness, but is globally pretty much a bell-shaped curve. Dichotomy looks like it will get quite close to the top of the curve (and I have been told that it works well on such problems). One thing that is nice with dichotomy for my needs is that it is not based on a gradient, and it works in a convex of the parameter space.

It seems that a simplex is what you need. It uses the barycenter (more or less) to find a new point in the simplex. And it works well only in convex functions (but in fact almost all functions have an issue with this :D)

...

Will the Nelder-Mead display such properties? It seems so to me, but I don't trust my quick read through of Wikipedia.

Yes, it does need a gradient and if the function is convex, it works in a convex in the parameter space.

...

I realize that maybe I should rephrase my question to try and draw more out of the common wealth of knowledge on this mailing list: what do people suggest to tackle this problem? Guided by Matthieu's suggestion, I have started looking at Powell's algorithm, and given the introduction of www.damtp.cam.ac.uk/user/na/NA_papers/NA2007_03.pdf I am wondering whether I should not investigate it. Can people provide any insights on these problems.

Indeed, Powell may also a solution. A simplex is just what is closer to what you hinted as an optimization algorithm.

...

Many thanks,

You're welcome ;)

...

Gael

PS: The reason I am looking at this optimization problem is that I got tired of looking at grid searches optimize the cross-validation score on my 3-parameter estimator (not in the scikit-learn, because it is way too specific to my problems).

Perhaps you may want to combine it with genetic algorithms. We also kind of combine grid search with simplex-based optimizer with simulated annealing in some of our global optimization problems, and I think I'll try at one point to introduce genetic algorithms instead of the grid search. Your problem is simpler though if it displays some convexity. Matthieu -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher

Gael Varoquaux

2:18 p.m.

New subject: N dimensional dichotomy optimization

On Mon, Nov 22, 2010 at 11:12:26PM +0100, Matthieu Brucher wrote:

...

It seems that a simplex is what you need.

Ha! I am learning new fancy words. Now I can start looking clever.

...

...
I realize that maybe I should rephrase my question to try and draw more out of the common wealth of knowledge on this mailing list: what do people suggest to tackle this problem? Guided by Matthieu's suggestion, I have started looking at Powell's algorithm, and given the introduction of www.damtp.cam.ac.uk/user/na/NA_papers/NA2007_03.pdf I am wondering whether I should not investigate it. Can people provide any insights on these problems.

...

Indeed, Powell may also a solution. A simplex is just what is closer to what you hinted as an optimization algorithm.

I'll do a bit more reading.

...

...
PS: The reason I am looking at this optimization problem is that I got tired of looking at grid searches optimize the cross-validation score on my 3-parameter estimator (not in the scikit-learn, because it is way too specific to my problems).

...

Perhaps you may want to combine it with genetic algorithms. We also kind of combine grid search with simplex-based optimizer with simulated annealing in some of our global optimization problems, and I think I'll try at one point to introduce genetic algorithms instead of the grid search.

Well, in the scikit, in the long run (it will take a little while) I'd like to expose other optimization methods then the GridSearchCV, so if you have code or advice to give us, we'd certainly be interested. Gael

Matthieu Brucher

2:27 p.m.

New subject: N dimensional dichotomy optimization

2010/11/22 Gael Varoquaux :

...

On Mon, Nov 22, 2010 at 11:12:26PM +0100, Matthieu Brucher wrote:

...
It seems that a simplex is what you need.

Ha! I am learning new fancy words. Now I can start looking clever.

...
...
I realize that maybe I should rephrase my question to try and draw more out of the common wealth of knowledge on this mailing list: what do people suggest to tackle this problem? Guided by Matthieu's suggestion, I have started looking at Powell's algorithm, and given the introduction of www.damtp.cam.ac.uk/user/na/NA_papers/NA2007_03.pdf I am wondering whether I should not investigate it. Can people provide any insights on these problems.

...
Indeed, Powell may also a solution. A simplex is just what is closer to what you hinted as an optimization algorithm.

I'll do a bit more reading.

...
...
PS: The reason I am looking at this optimization problem is that I got tired of looking at grid searches optimize the cross-validation score on my 3-parameter estimator (not in the scikit-learn, because it is way too specific to my problems).

...
Perhaps you may want to combine it with genetic algorithms. We also kind of combine grid search with simplex-based optimizer with simulated annealing in some of our global optimization problems, and I think I'll try at one point to introduce genetic algorithms instead of the grid search.

Well, in the scikit, in the long run (it will take a little while) I'd like to expose other optimization methods then the GridSearchCV, so if you have code or advice to give us, we'd certainly be interested.

Gael

There is scikits.optimization partly in the externals :D But I don't think they should be in scikits.learn directly. Of course, the scikit may need access to some global optimization methods, but the most used one is already there (the grid search). Then for genetic algorithms, pyevolve is pretty much all you want (I still have to check the multiprocessing part) Matthieu -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher

josef.pktd＠gmail.com

3:14 p.m.

New subject: N dimensional dichotomy optimization

On Mon, Nov 22, 2010 at 5:27 PM, Matthieu Brucher wrote:

...

2010/11/22 Gael Varoquaux :

...
On Mon, Nov 22, 2010 at 11:12:26PM +0100, Matthieu Brucher wrote:

...
It seems that a simplex is what you need.

Ha! I am learning new fancy words. Now I can start looking clever.

...
...
I realize that maybe I should rephrase my question to try and draw more out of the common wealth of knowledge on this mailing list: what do people suggest to tackle this problem? Guided by Matthieu's suggestion, I have started looking at Powell's algorithm, and given the introduction of www.damtp.cam.ac.uk/user/na/NA_papers/NA2007_03.pdf I am wondering whether I should not investigate it. Can people provide any insights on these problems.

...
Indeed, Powell may also a solution. A simplex is just what is closer to what you hinted as an optimization algorithm.

I'll do a bit more reading.

...
...
PS: The reason I am looking at this optimization problem is that I got tired of looking at grid searches optimize the cross-validation score on my 3-parameter estimator (not in the scikit-learn, because it is way too specific to my problems).

...
Perhaps you may want to combine it with genetic algorithms. We also kind of combine grid search with simplex-based optimizer with simulated annealing in some of our global optimization problems, and I think I'll try at one point to introduce genetic algorithms instead of the grid search.

Well, in the scikit, in the long run (it will take a little while) I'd like to expose other optimization methods then the GridSearchCV, so if you have code or advice to give us, we'd certainly be interested.

Gael

There is scikits.optimization partly in the externals :D But I don't think they should be in scikits.learn directly. Of course, the scikit may need access to some global optimization methods, but the most used one is already there (the grid search). Then for genetic algorithms, pyevolve is pretty much all you want (I still have to check the multiprocessing part)

Is that license http://pyevolve.sourceforge.net/license.html BSD compatible ? Josef

...

Matthieu -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Gael Varoquaux

2:36 p.m.

New subject: N dimensional dichotomy optimization

On Mon, Nov 22, 2010 at 11:12:26PM +0100, Matthieu Brucher wrote:

...

It seems that a simplex is what you need. It uses the barycenter (more or less) to find a new point in the simplex. And it works well only in convex functions (but in fact almost all functions have an issue with this :D)

One last question, now that I know that what I am looking for is a simplex algorithm (it indeed corresponds to what I was after), is there a reason not to use optimize.fmin? It implements a Nelder-Mead. I must admit that I don't see how I can use it to specify the convex hull of the parameters in which it operates, or restrict it to work only on integers, which are two things that I may want to do. Gaël

Matthieu Brucher

11:21 p.m.

New subject: N dimensional dichotomy optimization

2010/11/22 Gael Varoquaux :

...

On Mon, Nov 22, 2010 at 11:12:26PM +0100, Matthieu Brucher wrote:

...
It seems that a simplex is what you need. It uses the barycenter (more or less) to find a new point in the simplex. And it works well only in convex functions (but in fact almost all functions have an issue with this :D)

One last question, now that I know that what I am looking for is a simplex algorithm (it indeed corresponds to what I was after), is there a reason not to use optimize.fmin? It implements a Nelder-Mead. I must admit that I don't see how I can use it to specify the convex hull of the parameters in which it operates, or restrict it to work only on integers, which are two things that I may want to do.

optimize.fmin can be enough, I don't know it well enough. Nelder-Mead is not a constrained optimization algorithm, so you can't specify an outer hull. As for the integer part, I don't know if optimize.fmin is type consistent, I don't know if scikits.optimization is either, but I can check it. It should, as there is nothing impeding it. Matthieu -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher

Gael Varoquaux

23 Nov 23 Nov

12:52 a.m.

New subject: N dimensional dichotomy optimization

On Tue, Nov 23, 2010 at 08:21:55AM +0100, Matthieu Brucher wrote:

...

optimize.fmin can be enough, I don't know it well enough. Nelder-Mead is not a constrained optimization algorithm, so you can't specify an outer hull.

I saw that, after a bit more reading.

...

As for the integer part, I don't know if optimize.fmin is type consistent,

That not a problem: I wrap my function in a small object to ensure memoization, and input argument casting. The problem is that I can't tell the Nelder-Mead that the smallest jump it should attempt is .5. I can set xtol to .5, but it still attemps jumps of .001 in its initial jumps. Of course optimization on integers is fairly ill-posed, so I am asking for trouble. Gael

Matthieu Brucher

1:18 a.m.

New subject: N dimensional dichotomy optimization

...

The problem is that I can't tell the Nelder-Mead that the smallest jump it should attempt is .5. I can set xtol to .5, but it still attemps jumps of .001 in its initial jumps.

This is strange. It should not if the intiial points are set adequatly. You may want to check if the initial conditions make the optimization start at correct locations.

...

Of course optimization on integers is fairly ill-posed, so I am asking for trouble.

Indeed :D That's why GA can be a good solution as well. Matthieu -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher

Gael Varoquaux

1:27 a.m.

New subject: N dimensional dichotomy optimization

On Tue, Nov 23, 2010 at 10:18:50AM +0100, Matthieu Brucher wrote:

...

...
The problem is that I can't tell the Nelder-Mead that the smallest jump it should attempt is .5. I can set xtol to .5, but it still attemps jumps of .001 in its initial jumps.

...

This is strange. It should not if the intiial points are set adequatly. You may want to check if the initial conditions make the optimization start at correct locations.

Yes, that's excatly the problem. And it is easy to see why: in scipy.optimise.fmin, around line 186, the initial points are chosen with a relative distance of 0.00025 to the intial guess that is given. That's not what I want in the case of integers :).

...

...
Of course optimization on integers is fairly ill-posed, so I am asking for trouble.

...

Indeed :D That's why GA can be a good solution as well.

It's suboptimal if I know that my function is bell-shaped. Gael

Sebastian Walter

2:13 a.m.

New subject: N dimensional dichotomy optimization

Hello Gael, On Tue, Nov 23, 2010 at 10:27 AM, Gael Varoquaux wrote:

...

On Tue, Nov 23, 2010 at 10:18:50AM +0100, Matthieu Brucher wrote:

...
...
The problem is that I can't tell the Nelder-Mead that the smallest jump it should attempt is .5. I can set xtol to .5, but it still attemps jumps of .001 in its initial jumps.

...
This is strange. It should not if the intiial points are set adequatly. You may want to check if the initial conditions make the optimization start at correct locations.

Yes, that's excatly the problem. And it is easy to see why: in scipy.optimise.fmin, around line 186, the initial points are chosen with a relative distance of 0.00025 to the intial guess that is given. That's not what I want in the case of integers :).

I'm not familiar with dichotomy optimization. Several techniques have been proposed to solve the problem: genetic algorithms, simulated annealing, Nelder-Mead and Powell. To be honest, I find it quite confusing that these algorithms are named in the same breath. Do you have a continuous or a discrete problem? Is your problem of the following form? min_x f(x) s.t. lo <= Ax + b <= up 0 = g(x) 0 <= h(x) An if yes, in which space does x live? cheers, Sebastian

...

...
...
Of course optimization on integers is fairly ill-posed, so I am asking for trouble.

...
Indeed :D That's why GA can be a good solution as well.

It's suboptimal if I know that my function is bell-shaped.

Gael _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Gael Varoquaux

2:17 a.m.

New subject: N dimensional dichotomy optimization

On Tue, Nov 23, 2010 at 11:13:23AM +0100, Sebastian Walter wrote:

...

I'm not familiar with dichotomy optimization. Several techniques have been proposed to solve the problem: genetic algorithms, simulated annealing, Nelder-Mead and Powell. To be honest, I find it quite confusing that these algorithms are named in the same breath.

I am confused too. But that stems from my lack of knowledge in optimization.

...

Do you have a continuous or a discrete problem?

Both.

...

Is your problem of the following form?

...

min_x f(x) s.t. lo <= Ax + b <= up 0 = g(x) 0 <= h(x)

No constraints.

...

An if yes, in which space does x live?

Either in R^n, in the set of integers (unidimensional), or in the set of positive integers. Gaël

Sebastian Walter

2:37 a.m.

New subject: N dimensional dichotomy optimization

On Tue, Nov 23, 2010 at 11:17 AM, Gael Varoquaux wrote:

...

On Tue, Nov 23, 2010 at 11:13:23AM +0100, Sebastian Walter wrote:

...
I'm not familiar with dichotomy optimization. Several techniques have been proposed to solve the problem: genetic algorithms, simulated annealing, Nelder-Mead and Powell. To be honest, I find it quite confusing that these algorithms are named in the same breath.

I am confused too. But that stems from my lack of knowledge in optimization.

...
Do you have a continuous or a discrete problem?

Both.

...
Is your problem of the following form?

...
min_x f(x) s.t. lo <= Ax + b <= up 0 = g(x) 0 <= h(x)

No constraints.

didn't you say that you operate only in some convex hull?

...

...
An if yes, in which space does x live?

Either in R^n, in the set of integers (unidimensional), or in the set of positive integers.

According to http://openopt.org/Problems this is a mixed integer nonlinear program http://openopt.org/MINLP . I don't have experience with the solver though, but it may take a long time to run it since it uses branch-and-bound. In my field of work we typically relax the integers to real numbers, perform the optimization and then round to the next integer. This is often sufficiently close a good solution.

...

Gaël _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Gael Varoquaux

2:43 a.m.

New subject: N dimensional dichotomy optimization

On Tue, Nov 23, 2010 at 11:37:02AM +0100, Sebastian Walter wrote:

...

...
...
min_x f(x) s.t. lo <= Ax + b <= up 0 = g(x) 0 <= h(x)

...

...
No constraints.

...

didn't you say that you operate only in some convex hull?

No. I have an initial guess that allows me to specify a convex hull in which the minimum should probably lie, but its not a constraint: nothing bad happens if I leave that convex hull.

...

...
Either in R^n, in the set of integers (unidimensional), or in the set of positive integers. According to http://openopt.org/Problems this is a mixed integer nonlinear program http://openopt.org/MINLP .

It is indead the name I know for it, however I have additional hypothesis (namely that f is roughly convex) which makes it much easier.

...

I don't have experience with the solver though, but it may take a long time to run it since it uses branch-and-bound.

Yes, this is too brutal: this is for non convex optimization. Dichotomy seems well-suited for finding an optimum on the set of intehers.

...

In my field of work we typically relax the integers to real numbers, perform the optimization and then round to the next integer. This is often sufficiently close a good solution.

This is pretty much what I am doing, but you have to be careful: if the algorithm does jumps that are smaller than 1, it gets a zero difference between those jumps. If you are not careful, this might confuse a lot the algorithm and trick it into not converging. Thanks for your advice, Gaël

Sebastian Walter

5:47 a.m.

New subject: N dimensional dichotomy optimization

On Tue, Nov 23, 2010 at 11:43 AM, Gael Varoquaux wrote:

...

On Tue, Nov 23, 2010 at 11:37:02AM +0100, Sebastian Walter wrote:

...
...
...
min_x f(x) s.t. lo <= Ax + b <= up 0 = g(x) 0 <= h(x)

...
...
No constraints.

...
didn't you say that you operate only in some convex hull?

No. I have an initial guess that allows me to specify a convex hull in which the minimum should probably lie, but its not a constraint: nothing bad happens if I leave that convex hull.

...
...
Either in R^n, in the set of integers (unidimensional), or in the set of positive integers. According to http://openopt.org/Problems this is a mixed integer nonlinear program http://openopt.org/MINLP .

It is indead the name I know for it, however I have additional hypothesis (namely that f is roughly convex) which makes it much easier.

...
I don't have experience with the solver though, but it may take a long time to run it since it uses branch-and-bound.

Yes, this is too brutal: this is for non convex optimization. Dichotomy seems well-suited for finding an optimum on the set of intehers.

...
In my field of work we typically relax the integers to real numbers, perform the optimization and then round to the next integer. This is often sufficiently close a good solution.

This is pretty much what I am doing, but you have to be careful: if the algorithm does jumps that are smaller than 1, it gets a zero difference between those jumps. If you are not careful, this might confuse a lot the algorithm and trick it into not converging.

ah, that clears things up a lot. Well, I don't know what the best method is to solve your problem, so take the following with a grain of salt: Wouldn't it be better to change the model than modifying the optimization algorithm? It sounds as if the resulting objective function is piecewise constant. AFAIK most optimization algorithms for continuous problems require at least Lipschitz continuous functions to work ''acceptable well''. Not sure if this is also true for Nelder-Mead.

...

Thanks for your advice,

Gaël _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Gael Varoquaux

5:50 a.m.

New subject: N dimensional dichotomy optimization

On Tue, Nov 23, 2010 at 02:47:10PM +0100, Sebastian Walter wrote:

...

Well, I don't know what the best method is to solve your problem, so take the following with a grain of salt: Wouldn't it be better to change the model than modifying the optimization algorithm?

In this case, that's not possible. You can think of this parameter as the number of components in a PCA (it's actually a more complex dictionnary learning framework), so it's a parameter that is discrete, and I can't do anything about it :).

...

It sounds as if the resulting objective function is piecewise constant.

...

AFAIK most optimization algorithms for continuous problems require at least Lipschitz continuous functions to work ''acceptable well''. Not sure if this is also true for Nelder-Mead.

Yes correct. We do have a problem. I have a Nelder-Mead that seems to be working quite well on a few toy problems. Gaël

josef.pktd＠gmail.com

7:19 a.m.

New subject: N dimensional dichotomy optimization

On Tue, Nov 23, 2010 at 8:50 AM, Gael Varoquaux wrote:

...

On Tue, Nov 23, 2010 at 02:47:10PM +0100, Sebastian Walter wrote:

...
Well, I don't know what the best method is to solve your problem, so take the following with a grain of salt: Wouldn't it be better to change the model than modifying the optimization algorithm?

In this case, that's not possible. You can think of this parameter as the number of components in a PCA (it's actually a more complex dictionnary learning framework), so it's a parameter that is discrete, and I can't do anything about it :).

...
It sounds as if the resulting objective function is piecewise constant.

...
AFAIK most optimization algorithms for continuous problems require at least Lipschitz continuous functions to work ''acceptable well''. Not sure if this is also true for Nelder-Mead.

Yes correct. We do have a problem.

I have a Nelder-Mead that seems to be working quite well on a few toy problems.

Assuming your function is well behaved, one possible idea is to try replacing the integer objective function with a continuous interpolation. Or maybe fit a bellshaped curve to a few gridpoints. It might get you faster into the right neighborhood to do an exact search. (There are some similar methods of using a surrogate objective function, when it is very expensive or impossible to calculate an objective function, but I never looked closely at these cases.) Josef

...

Gaël _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Gael Varoquaux

7:53 a.m.

New subject: N dimensional dichotomy optimization

On Tue, Nov 23, 2010 at 10:19:06AM -0500, josef.pktd@gmail.com wrote:

...

...
I have a Nelder-Mead that seems to be working quite well on a few toy problems.

...

Assuming your function is well behaved, one possible idea is to try replacing the integer objective function with a continuous interpolation. Or maybe fit a bellshaped curve to a few gridpoints. It might get you faster into the right neighborhood to do an exact search.

I've actually been wondering if Gaussian Process regression (aka Krigin) would not be useful here :). It's fairly good at fitting processes for which there is very irregular information. Now I am perfectly aware that any move in this direction would require significant work, so this is more day dreaming than project planning. Gael

Sebastian Walter

7:33 a.m.

New subject: N dimensional dichotomy optimization

On Tue, Nov 23, 2010 at 2:50 PM, Gael Varoquaux wrote:

...

On Tue, Nov 23, 2010 at 02:47:10PM +0100, Sebastian Walter wrote:

...
Well, I don't know what the best method is to solve your problem, so take the following with a grain of salt: Wouldn't it be better to change the model than modifying the optimization algorithm?

In this case, that's not possible. You can think of this parameter as the number of components in a PCA (it's actually a more complex dictionnary learning framework), so it's a parameter that is discrete, and I can't do anything about it :).

In optimum experimental design one encounters MINLPs where integers define the number of rows of a matrix. At first glance it looks as if a relaxation is simply not possible: either there are additional rows or not. But with some technical transformations it is possible to reformulate the problem into a form that allows the relaxation of the integer constraint in a natural way. Maybe this is also possible in your case? Otherwise, well, let me know if you find a working solution ;)

...

...
It sounds as if the resulting objective function is piecewise constant.

...
AFAIK most optimization algorithms for continuous problems require at least Lipschitz continuous functions to work ''acceptable well''. Not sure if this is also true for Nelder-Mead.

Yes correct. We do have a problem.

I have a Nelder-Mead that seems to be working quite well on a few toy problems.

Gaël _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Gael Varoquaux

7:57 a.m.

New subject: N dimensional dichotomy optimization

On Tue, Nov 23, 2010 at 04:33:00PM +0100, Sebastian Walter wrote:

...

At first glance it looks as if a relaxation is simply not possible: either there are additional rows or not. But with some technical transformations it is possible to reformulate the problem into a form that allows the relaxation of the integer constraint in a natural way.

...

Maybe this is also possible in your case?

Well, given that it is a cross-validation score that I am optimizing, there is not simple algorithm giving this score, so it's not obvious at all that there is a possible relaxation. A road to follow would be to find an oracle giving empirical risk after estimation of the penalized problem, and try to relax this oracle. That's two steps further than I am (I apologize if the above paragraph is incomprehensible, I am getting too much in the technivalities of my problem.

...

Otherwise, well, let me know if you find a working solution ;)

Nelder-Mead seems to be working fine, so far. It will take a few weeks (or more) to have a real insight on what works and what doesn't. Thanks for your input, Gael

Zachary Pincus

8:31 a.m.

New subject: N dimensional dichotomy optimization

On Nov 23, 2010, at 10:57 AM, Gael Varoquaux wrote:

...

On Tue, Nov 23, 2010 at 04:33:00PM +0100, Sebastian Walter wrote:

...
At first glance it looks as if a relaxation is simply not possible: either there are additional rows or not. But with some technical transformations it is possible to reformulate the problem into a form that allows the relaxation of the integer constraint in a natural way.

...
Maybe this is also possible in your case?

Well, given that it is a cross-validation score that I am optimizing, there is not simple algorithm giving this score, so it's not obvious at all that there is a possible relaxation. A road to follow would be to find an oracle giving empirical risk after estimation of the penalized problem, and try to relax this oracle. That's two steps further than I am (I apologize if the above paragraph is incomprehensible, I am getting too much in the technivalities of my problem.

...
Otherwise, well, let me know if you find a working solution ;)

Nelder-Mead seems to be working fine, so far. It will take a few weeks (or more) to have a real insight on what works and what doesn't.

Jumping in a little late, but it seems that simulated annealing might be a decent method here: take random steps (drawing from a distribution of integer step sizes), reject steps that fall outside the fitting range, and accept steps according to the standard annealing formula. Something with a global optimum but spikes along the way is pretty well-suited to SA in general, and it's also an easy algorithm to make work on a lattice. If you're in high dimensions, there are also bolt- on methods for biasing the steps toward "good directions" as opposed to just taking isotropic random steps. Again, pretty easy to think of discrete implementations of this... Zach

Matthieu Brucher

10:14 a.m.

New subject: N dimensional dichotomy optimization

2010/11/23 Zachary Pincus :

...

On Nov 23, 2010, at 10:57 AM, Gael Varoquaux wrote:

...
On Tue, Nov 23, 2010 at 04:33:00PM +0100, Sebastian Walter wrote:

...
At first glance it looks as if a relaxation is simply not possible: either there are additional rows or not. But with some technical transformations it is possible to reformulate the problem into a form that allows the relaxation of the integer constraint in a natural way.

...
Maybe this is also possible in your case?

Well, given that it is a cross-validation score that I am optimizing, there is not simple algorithm giving this score, so it's not obvious at all that there is a possible relaxation. A road to follow would be to find an oracle giving empirical risk after estimation of the penalized problem, and try to relax this oracle. That's two steps further than I am (I apologize if the above paragraph is incomprehensible, I am getting too much in the technivalities of my problem.

...
Otherwise, well, let me know if you find a working solution ;)

Nelder-Mead seems to be working fine, so far. It will take a few weeks (or more) to have a real insight on what works and what doesn't.

Jumping in a little late, but it seems that simulated annealing might be a decent method here: take random steps (drawing from a distribution of integer step sizes), reject steps that fall outside the fitting range, and accept steps according to the standard annealing formula.

There is also a simulated-annealing modification of Nelder Mead that can be of use. Matthieu -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher

Gael Varoquaux

3:15 p.m.

New subject: N dimensional dichotomy optimization

On Tue, Nov 23, 2010 at 07:14:56PM +0100, Matthieu Brucher wrote:

...

...
Jumping in a little late, but it seems that simulated annealing might be a decent method here: take random steps (drawing from a distribution of integer step sizes), reject steps that fall outside the fitting range, and accept steps according to the standard annealing formula.

...

There is also a simulated-annealing modification of Nelder Mead that can be of use.

Sounds interesting. Any reference? G

Matthieu Brucher

11:53 p.m.

New subject: N dimensional dichotomy optimization

2010/11/24 Gael Varoquaux :

...

On Tue, Nov 23, 2010 at 07:14:56PM +0100, Matthieu Brucher wrote:

...
...
Jumping in a little late, but it seems that simulated annealing might be a decent method here: take random steps (drawing from a distribution of integer step sizes), reject steps that fall outside the fitting range, and accept steps according to the standard annealing formula.

...
There is also a simulated-annealing modification of Nelder Mead that can be of use.

Sounds interesting. Any reference?

Not right away, I have to check. The main difference is the possible acceptance of a contraction that doesn't lower the cost, and this is done with a temperature like simulated annealing. Matthieu -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher

Charles R Harris

24 Nov 24 Nov

9:18 a.m.

New subject: N dimensional dichotomy optimization

On Tue, Nov 23, 2010 at 3:37 AM, Sebastian Walter < sebastian.walter@gmail.com> wrote:

...

On Tue, Nov 23, 2010 at 11:17 AM, Gael Varoquaux wrote:

...
On Tue, Nov 23, 2010 at 11:13:23AM +0100, Sebastian Walter wrote:

...
I'm not familiar with dichotomy optimization. Several techniques have been proposed to solve the problem: genetic algorithms, simulated annealing, Nelder-Mead and Powell. To be honest, I find it quite confusing that these algorithms are named in the same breath.

I am confused too. But that stems from my lack of knowledge in optimization.

...
Do you have a continuous or a discrete problem?

Both.

...
Is your problem of the following form?

...
min_x f(x) s.t. lo <= Ax + b <= up 0 = g(x) 0 <= h(x)

No constraints.

didn't you say that you operate only in some convex hull?

...
...
An if yes, in which space does x live?

Either in R^n, in the set of integers (unidimensional), or in the set of positive integers.

According to http://openopt.org/Problems this is a mixed integer nonlinear program http://openopt.org/MINLP . I don't have experience with the solver though, but it may take a long time to run it since it uses branch-and-bound. In my field of work we typically relax the integers to real numbers, perform the optimization and then round to the next integer. This is often sufficiently close a good solution.

I've sometimes applied a genetic algorithm to the rounding, which can improve things a bit if needed. Chuck

Paul Anton Letnes

28 Nov 28 Nov

12:46 a.m.

New subject: N dimensional dichotomy optimization

...

On Tue, Nov 23, 2010 at 3:37 AM, Sebastian Walter wrote: On Tue, Nov 23, 2010 at 11:17 AM, Gael Varoquaux wrote:

...
On Tue, Nov 23, 2010 at 11:13:23AM +0100, Sebastian Walter wrote:

...
I'm not familiar with dichotomy optimization. Several techniques have been proposed to solve the problem: genetic algorithms, simulated annealing, Nelder-Mead and Powell. To be honest, I find it quite confusing that these algorithms are named in the same breath.

I am confused too. But that stems from my lack of knowledge in optimization.

...
Do you have a continuous or a discrete problem?

Both.

I would like to advertise a bit for genetic algorithms. In my experience, they seem to be the most powerful of the optimization techniques mentioned here. In particular, they are good at getting out of local minima, and don't really care if you are talking about integer or continuous problems. As long as you can think of a good representation and good genetic operators, you should be good! I have just a little experience with pyevolve myself, but it is really easy to test GAs with pyevolve, as you just have to make a few settings to get going. One word of warning: GA performance is very sensitive to the actual parameters you choose! Especially, you should think about mutation rates, crossover rates, selection protocols, and number of crossover points. (This list came off the top of my head...) If you have any GA questions, ask, and perhaps I can come up with an answer. Paul.

Charles R Harris

20 Nov 20 Nov

8:41 p.m.

On Sat, Nov 20, 2010 at 4:39 PM, Keith Goodman wrote:

...

On Fri, Nov 19, 2010 at 7:42 PM, Keith Goodman wrote:

...
I should make a benchmark suite.

...
...
ny.benchit(verbose=False) Nanny performance benchmark Nanny 0.0.1dev Numpy 1.4.1 Speed is numpy time divided by nanny time NaN means all NaNs Speed Test Shape dtype NaN? 6.6770 nansum(a, axis=-1) (500,500) int64 4.6612 nansum(a, axis=-1) (10000,) float64 9.0351 nansum(a, axis=-1) (500,500) int32 3.0746 nansum(a, axis=-1) (500,500) float64 11.5740 nansum(a, axis=-1) (10000,) int32 6.4484 nansum(a, axis=-1) (10000,) int64 51.3917 nansum(a, axis=-1) (500,500) float64 NaN 13.8692 nansum(a, axis=-1) (10000,) float64 NaN 6.5327 nanmax(a, axis=-1) (500,500) int64 8.8222 nanmax(a, axis=-1) (10000,) float64 0.2059 nanmax(a, axis=-1) (500,500) int32 6.9262 nanmax(a, axis=-1) (500,500) float64 5.0688 nanmax(a, axis=-1) (10000,) int32 6.5605 nanmax(a, axis=-1) (10000,) int64 48.4850 nanmax(a, axis=-1) (500,500) float64 NaN 14.6289 nanmax(a, axis=-1) (10000,) float64 NaN

Here's what I get using (my current) np.fmax.reduce in place of nanmax. Speed Test Shape dtype NaN? 3.3717 nansum(a, axis=-1) (500,500) int64 5.1639 nansum(a, axis=-1) (10000,) float64 3.8308 nansum(a, axis=-1) (500,500) int32 6.0854 nansum(a, axis=-1) (500,500) float64 8.7821 nansum(a, axis=-1) (10000,) int32 1.1716 nansum(a, axis=-1) (10000,) int64 5.5777 nansum(a, axis=-1) (500,500) float64 NaN 5.8718 nansum(a, axis=-1) (10000,) float64 NaN 0.5419 nanmax(a, axis=-1) (500,500) int64 2.8732 nanmax(a, axis=-1) (10000,) float64 0.0301 nanmax(a, axis=-1) (500,500) int32 2.7437 nanmax(a, axis=-1) (500,500) float64 0.7868 nanmax(a, axis=-1) (10000,) int32 0.5535 nanmax(a, axis=-1) (10000,) int64 2.8715 nanmax(a, axis=-1) (500,500) float64 NaN 2.5937 nanmax(a, axis=-1) (10000,) float64 NaN I think the really small int32 ratio is due to timing granularity. For random ints in the range 0..99 the results are not quite as good for fmax, which I find puzzling. Speed Test Shape dtype NaN? 3.4021 nansum(a, axis=-1) (500,500) int64 5.5913 nansum(a, axis=-1) (10000,) float64 4.4569 nansum(a, axis=-1) (500,500) int32 6.6202 nansum(a, axis=-1) (500,500) float64 7.1847 nansum(a, axis=-1) (10000,) int32 2.0448 nansum(a, axis=-1) (10000,) int64 6.0257 nansum(a, axis=-1) (500,500) float64 NaN 6.3172 nansum(a, axis=-1) (10000,) float64 NaN 0.9598 nanmax(a, axis=-1) (500,500) int64 3.2407 nanmax(a, axis=-1) (10000,) float64 0.0520 nanmax(a, axis=-1) (500,500) int32 3.1954 nanmax(a, axis=-1) (500,500) float64 1.5538 nanmax(a, axis=-1) (10000,) int32 0.3716 nanmax(a, axis=-1) (10000,) int64 3.2372 nanmax(a, axis=-1) (500,500) float64 NaN 2.5633 nanmax(a, axis=-1) (10000,) float64 NaN Chuck

Christopher Barker

19 Nov 19 Nov

2:18 p.m.

On 11/19/10 11:19 AM, Keith Goodman wrote:

...

On Fri, Nov 19, 2010 at 10:55 AM, Nathaniel Smith wrote:

...
Why not make this a patch to numpy/scipy instead?

My guess is that having separate underlying functions for each dtype, ndim, and axis would be a nightmare for a large project like Numpy.

True, but: 1) Having special-cases for the most common cases is not such a bad idea. 2) could one use some sort of templating approach to get all the dtypes and such that you want? 3) as for number of dimensions, I don't think to would be to hard to generalize that -- at least for contiguous arrays. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Charles R Harris

4:04 p.m.

On Fri, Nov 19, 2010 at 3:18 PM, Christopher Barker wrote:

...

On 11/19/10 11:19 AM, Keith Goodman wrote:

...
On Fri, Nov 19, 2010 at 10:55 AM, Nathaniel Smith wrote:

...
Why not make this a patch to numpy/scipy instead?

My guess is that having separate underlying functions for each dtype, ndim, and axis would be a nightmare for a large project like Numpy.

True, but:

1) Having special-cases for the most common cases is not such a bad idea.

2) could one use some sort of templating approach to get all the dtypes and such that you want?

3) as for number of dimensions, I don't think to would be to hard to generalize that -- at least for contiguous arrays.

Note that the fmax/fmin versions can be sped up in the same way as sum.reduce was. Also, you should pass the flattened array to the routine for the axis=None case. Chuck

4895

Age (days ago)

4904

Last active (days ago)

List overview

Download

63 comments

13 participants

participants (13)

Benjamin Root
Charles R Harris
Christopher Barker
Gael Varoquaux
josef.pktd＠gmail.com
Keith Goodman
Matthieu Brucher
Nathaniel Smith
Paul Anton Letnes
Pauli Virtanen
Sebastian Walter
Wes McKinney
Zachary Pincus

[ANN] Nanny, faster NaN functions

Keith Goodman

Benjamin Root

Keith Goodman

Keith Goodman

Keith Goodman

Keith Goodman

Keith Goodman

Keith Goodman

Keith Goodman

Keith Goodman

Keith Goodman

Keith Goodman

Keith Goodman

Keith Goodman

Keith Goodman

Keith Goodman

Sebastian Walter

Sebastian Walter

Sebastian Walter

Sebastian Walter

Zachary Pincus

Paul Anton Letnes

tags

participants (13)