Mailman 3 Changing the return type of np.histogramdd - NumPy-Discussion

Changing the return type of np.histogramdd

Eric Wieser

10 Apr 2018 10 Apr '18

5:24 a.m.

Numpy has three histogram functions - histogram, histogram2d, and histogramdd. histogram is by far the most widely used, and in the absence of weights and normalization, returns an np.intp count for each bin. histogramdd (for which histogram2d is a wrapper) returns np.float64 in all circumstances. As a contrived comparison

...

...
...
x = np.linspace(0, 1)>>> h, e = np.histogram(x*x, bins=4); h array([25, 10, 8, 7], dtype=int64)>>> h, e = np.histogramdd((x*x,), bins=4); h array([25., 10., 8., 7.])

https://github.com/numpy/numpy/issues/7845 tracks this inconsistency. The fix is now trivial: the question is, will changing the return type break people’s code? Either we should: 1. Just change it, and hope no one is broken by it 2. Add a dtype argument: - If dtype=None, behave like np.histogram - If dtype is not specified, emit a future warning recommending to use dtype=None or dtype=float - In future, change the default to None 3. Create a new better-named function histogram_nd, which can also be created without the mistake that is https://github.com/numpy/numpy/issues/10864. Thoughts? Eric

Attachments:

attachment.htm (text/html — 9.3 KB)

Show replies by date

Jerome Kieffer

10 Apr 10 Apr

7:22 a.m.

...

Either we should:

1. Just change it, and hope no one is broken by it 2. Add a dtype argument: - If dtype=None, behave like np.histogram - If dtype is not specified, emit a future warning recommending to use dtype=None or dtype=float - In future, change the default to None 3. Create a new better-named function histogram_nd, which can also be created without the mistake that is https://github.com/numpy/numpy/issues/10864.

Thoughts?

I like the option 2. By the way, we (@ESRF) re-developped many times histogram and histogram_nd in various projects in order to have a better consistency on the one hand and better performances on the other (re-written in C or C++). I noticed a noticeable gain in performance in the last years of numpy but I did not check consistency. The issue is that every bin should be an interval open on the right-hand side which causes stability issues depending as the smallest value greater than the max depend on the input dtype. For example the smallest value greater than 10 is 11 in int but 10.000001 in float32 and 10.000000000000002 in float64. Cheers, -- Jérôme Kieffer

John T. Goetz

5:34 p.m.

On Tue, 2018-04-10 at 09:22 +0200, Jerome Kieffer wrote:

...

...
Either we should:

1. Just change it, and hope no one is broken by it 2. Add a dtype argument: - If dtype=None, behave like np.histogram - If dtype is not specified, emit a future warning recommending to use dtype=None or dtype=float - In future, change the default to None 3. Create a new better-named function histogram_nd, which can also be created without the mistake that is https://github.com/numpy/numpy/issues/10864.

Thoughts?

I like the option 2.

By the way, we (@ESRF) re-developped many times histogram and histogram_nd in various projects in order to have a better consistency on the one hand and better performances on the other (re-written in C or C++). -- Jérôme Kieffer

I think this was a mistake and should be fixed so option 1 is my preference. A dtype argument might be convenient but what does that gain over having the user do something like result.astype(np.float64) ? Jérôme, as to performance, I have a PR that pushes histogramming code into C here: https://github.com/numpy/numpy/pull/9910 . After it was submitted, there was a rearrangement of the code which broke the merge to master. I've been meaning to update the PR to get it through, but haven't had the time. -- John T. Goetz

Ralf Gommers

26 Apr 26 Apr

4:56 a.m.

On Mon, Apr 9, 2018 at 10:24 PM, Eric Wieser wrote:

...

Numpy has three histogram functions - histogram, histogram2d, and histogramdd.

histogram is by far the most widely used, and in the absence of weights and normalization, returns an np.intp count for each bin.

histogramdd (for which histogram2d is a wrapper) returns np.float64 in all circumstances.

As a contrived comparison

...
...
...
x = np.linspace(0, 1)>>> h, e = np.histogram(x*x, bins=4); h array([25, 10, 8, 7], dtype=int64)>>> h, e = np.histogramdd((x*x,), bins=4); h array([25., 10., 8., 7.])

https://github.com/numpy/numpy/issues/7845 tracks this inconsistency.

The fix is now trivial: the question is, will changing the return type break people’s code?

Either we should:

1. Just change it, and hope no one is broken by it 2. Add a dtype argument: - If dtype=None, behave like np.histogram - If dtype is not specified, emit a future warning recommending to use dtype=None or dtype=float - In future, change the default to None 3. Create a new better-named function histogram_nd, which can also be created without the mistake that is https://github.com/numpy/ numpy/issues/10864.

Thoughts?

(1) sems like a no-go, taking such risks isn't justified by a minor inconsistency. (2) is still fairly intrusive, you're emitting warnings for everyone and still force people to change their code (and if they don't they may run into a backwards compat break). (3) is the best of these options, however is this really worth a new function? My vote would be "do nothing". Ralf

Eric Wieser

5:07 a.m.

what does that gain over having the user do something like result.astype() It means that the user can use integer weights without worrying about losing precision due to an intermediate float representation. It also means they can use higher precision values (np.longdouble) or complex weights. you’re emitting warnings for everyone When there’s a risk of precision loss, that seems like the responsible thing to do. Users passing float weights would see no warning, I suppose. is this really worth a new function There ought to be a function for computing histograms with integer weights that doesn’t lose precision. Either we change the existing function to do that, or we make a new function. A possible compromise: like 1, but only change the dtype of the result if a weights argument is passed. #10864 https://github.com/numpy/numpy/issues/10864 seems like a worrying design flaw too, but I suppose that can be dealt with separately. Eric On Wed, 25 Apr 2018 at 21:57 Ralf Gommers wrote:

...

On Mon, Apr 9, 2018 at 10:24 PM, Eric Wieser wrote:

...
Numpy has three histogram functions - histogram, histogram2d, and histogramdd.

histogram is by far the most widely used, and in the absence of weights and normalization, returns an np.intp count for each bin.

histogramdd (for which histogram2d is a wrapper) returns np.float64 in all circumstances.

As a contrived comparison

...
...
...
x = np.linspace(0, 1)>>> h, e = np.histogram(x*x, bins=4); h array([25, 10, 8, 7], dtype=int64)>>> h, e = np.histogramdd((x*x,), bins=4); h array([25., 10., 8., 7.])

https://github.com/numpy/numpy/issues/7845 tracks this inconsistency.

The fix is now trivial: the question is, will changing the return type break people’s code?

Either we should:

1. Just change it, and hope no one is broken by it 2. Add a dtype argument: - If dtype=None, behave like np.histogram - If dtype is not specified, emit a future warning recommending to use dtype=None or dtype=float - In future, change the default to None 3. Create a new better-named function histogram_nd, which can also be created without the mistake that is https://github.com/numpy/numpy/issues/10864.

Thoughts?

(1) sems like a no-go, taking such risks isn't justified by a minor inconsistency.

(2) is still fairly intrusive, you're emitting warnings for everyone and still force people to change their code (and if they don't they may run into a backwards compat break).

(3) is the best of these options, however is this really worth a new function? My vote would be "do nothing".

Ralf

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Ralf Gommers

5:50 a.m.

On Wed, Apr 25, 2018 at 10:07 PM, Eric Wieser wrote:

...

what does that gain over having the user do something like result.astype()

It means that the user can use integer weights without worrying about losing precision due to an intermediate float representation.

It also means they can use higher precision values (np.longdouble) or complex weights.

None of that seems particularly important to be honest. you’re emitting warnings for everyone

...

When there’s a risk of precision loss, that seems like the responsible thing to do.

For precision loss of the order of float64 eps, I disagree. There will be many such places in numpy and in other core libraries.

...

Users passing float weights would see no warning, I suppose.

is this really worth a new function

There ought to be a function for computing histograms with integer weights that doesn’t lose precision. Either we change the existing function to do that, or we make a new function.

It's also possible to refer users to scipy.stats.binned_statistic(_2d/dd), which provides a superset of the histogram functionality and is internally consistent because the implementations of 1d/2d call the dd one. Ralf

...

A possible compromise: like 1, but only change the dtype of the result if a weights argument is passed.

#10864 https://github.com/numpy/numpy/issues/10864 seems like a worrying design flaw too, but I suppose that can be dealt with separately.

Eric

On Wed, 25 Apr 2018 at 21:57 Ralf Gommers wrote:

...
On Mon, Apr 9, 2018 at 10:24 PM, Eric Wieser
...
wrote:

...
Numpy has three histogram functions - histogram, histogram2d, and histogramdd.

histogram is by far the most widely used, and in the absence of weights and normalization, returns an np.intp count for each bin.

histogramdd (for which histogram2d is a wrapper) returns np.float64 in all circumstances.

As a contrived comparison

...
...
...
x = np.linspace(0, 1)>>> h, e = np.histogram(x*x, bins=4); h array([25, 10, 8, 7], dtype=int64)>>> h, e = np.histogramdd((x*x,), bins=4); h array([25., 10., 8., 7.])

https://github.com/numpy/numpy/issues/7845 tracks this inconsistency.

The fix is now trivial: the question is, will changing the return type break people’s code?

Either we should:

1. Just change it, and hope no one is broken by it 2. Add a dtype argument: - If dtype=None, behave like np.histogram - If dtype is not specified, emit a future warning recommending to use dtype=None or dtype=float - In future, change the default to None 3. Create a new better-named function histogram_nd, which can also be created without the mistake that is https://github.com/numpy/ numpy/issues/10864.

Thoughts?

(1) sems like a no-go, taking such risks isn't justified by a minor inconsistency.

(2) is still fairly intrusive, you're emitting warnings for everyone and still force people to change their code (and if they don't they may run into a backwards compat break).

(3) is the best of these options, however is this really worth a new function? My vote would be "do nothing".

Ralf

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Eric Wieser

6 a.m.

For precision loss of the order of float64 eps, I disagree. I was thinking more about precision loss on the order of 1, for large 64-bit integers that can’t fit in a float64 Note also that #10864 https://github.com/numpy/numpy/issues/10864 incurs deliberate precision loss of the order 10**-6 x smallest bin, which is also much larger than eps. It’s also possible to refer users to scipy.stats.binned_statistic That sounds like a good idea to do irrespective of whether histogramdd has problems - I had no idea those existed. Is there a precedent for referring to more feature-rich scipy functions from the basic numpy ones? On Wed, 25 Apr 2018 at 22:51 Ralf Gommers wrote:

...

On Wed, Apr 25, 2018 at 10:07 PM, Eric Wieser
...
wrote:

...
what does that gain over having the user do something like result.astype()

It means that the user can use integer weights without worrying about losing precision due to an intermediate float representation.

It also means they can use higher precision values (np.longdouble) or complex weights.

None of that seems particularly important to be honest.

you’re emitting warnings for everyone

...
When there’s a risk of precision loss, that seems like the responsible thing to do.

For precision loss of the order of float64 eps, I disagree. There will be many such places in numpy and in other core libraries.

...
Users passing float weights would see no warning, I suppose.

is this really worth a new function

There ought to be a function for computing histograms with integer weights that doesn’t lose precision. Either we change the existing function to do that, or we make a new function.

It's also possible to refer users to scipy.stats.binned_statistic(_2d/dd), which provides a superset of the histogram functionality and is internally consistent because the implementations of 1d/2d call the dd one.

Ralf

...
A possible compromise: like 1, but only change the dtype of the result if a weights argument is passed.

#10864 https://github.com/numpy/numpy/issues/10864 seems like a worrying design flaw too, but I suppose that can be dealt with separately.

Eric

On Wed, 25 Apr 2018 at 21:57 Ralf Gommers wrote:

...
On Mon, Apr 9, 2018 at 10:24 PM, Eric Wieser < wieser.eric+numpy@gmail.com> wrote:

...
Numpy has three histogram functions - histogram, histogram2d, and histogramdd.

histogram is by far the most widely used, and in the absence of weights and normalization, returns an np.intp count for each bin.

histogramdd (for which histogram2d is a wrapper) returns np.float64 in all circumstances.

As a contrived comparison

...
...
> x = np.linspace(0, 1)>>> h, e = np.histogram(x*x, bins=4); h array([25, 10, 8, 7], dtype=int64)>>> h, e = np.histogramdd((x*x,), bins=4); h array([25., 10., 8., 7.])

https://github.com/numpy/numpy/issues/7845 tracks this inconsistency.

The fix is now trivial: the question is, will changing the return type break people’s code?

Either we should:

1. Just change it, and hope no one is broken by it 2. Add a dtype argument: - If dtype=None, behave like np.histogram - If dtype is not specified, emit a future warning recommending to use dtype=None or dtype=float - In future, change the default to None 3. Create a new better-named function histogram_nd, which can also be created without the mistake that is https://github.com/numpy/numpy/issues/10864.

Thoughts?

(1) sems like a no-go, taking such risks isn't justified by a minor inconsistency.

(2) is still fairly intrusive, you're emitting warnings for everyone and still force people to change their code (and if they don't they may run into a backwards compat break).

(3) is the best of these options, however is this really worth a new function? My vote would be "do nothing".

Ralf

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Ralf Gommers

28 Apr 28 Apr

5:25 a.m.

On Wed, Apr 25, 2018 at 11:00 PM, Eric Wieser wrote:

...

For precision loss of the order of float64 eps, I disagree.

I was thinking more about precision loss on the order of 1, for large 64-bit integers that can’t fit in a float64

It's late and I'm probably missing something, but:

...

...
...
np.iinfo(np.int64).max > np.finfo(np.float64).max False

Either way, such weights don't really happen in real code I think.

...

Note also that #10864 https://github.com/numpy/numpy/issues/10864 incurs deliberate precision loss of the order 10**-6 x smallest bin, which is also much larger than eps.

Yeah that's worse.

...

It’s also possible to refer users to scipy.stats.binned_statistic

That sounds like a good idea to do irrespective of whether histogramdd has problems - I had no idea those existed. Is there a precedent for referring to more feature-rich scipy functions from the basic numpy ones?

Yes, there are cross-links to Python, SciPy and Matplotlib functions in the docs. This is done with intersphinx ( https://github.com/numpy/numpy/blob/master/doc/source/conf.py#L215). Example cross-link for convolve: https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.convolve.h... Ralf

...

On Wed, 25 Apr 2018 at 22:51 Ralf Gommers wrote:

...
On Wed, Apr 25, 2018 at 10:07 PM, Eric Wieser < wieser.eric+numpy@gmail.com> wrote:

...
what does that gain over having the user do something like result.astype()

It means that the user can use integer weights without worrying about losing precision due to an intermediate float representation.

It also means they can use higher precision values (np.longdouble) or complex weights.

None of that seems particularly important to be honest.

you’re emitting warnings for everyone

...
When there’s a risk of precision loss, that seems like the responsible thing to do.

For precision loss of the order of float64 eps, I disagree. There will be many such places in numpy and in other core libraries.

...
Users passing float weights would see no warning, I suppose.

is this really worth a new function

There ought to be a function for computing histograms with integer weights that doesn’t lose precision. Either we change the existing function to do that, or we make a new function.

It's also possible to refer users to scipy.stats.binned_statistic(_2d/dd), which provides a superset of the histogram functionality and is internally consistent because the implementations of 1d/2d call the dd one.

Ralf

...
A possible compromise: like 1, but only change the dtype of the result if a weights argument is passed.

#10864 https://github.com/numpy/numpy/issues/10864 seems like a worrying design flaw too, but I suppose that can be dealt with separately.

Eric

On Wed, 25 Apr 2018 at 21:57 Ralf Gommers wrote:

...
On Mon, Apr 9, 2018 at 10:24 PM, Eric Wieser < wieser.eric+numpy@gmail.com> wrote:

...
Numpy has three histogram functions - histogram, histogram2d, and histogramdd.

histogram is by far the most widely used, and in the absence of weights and normalization, returns an np.intp count for each bin.

histogramdd (for which histogram2d is a wrapper) returns np.float64 in all circumstances.

As a contrived comparison

...
>> x = np.linspace(0, 1)>>> h, e = np.histogram(x*x, bins=4); h array([25, 10, 8, 7], dtype=int64)>>> h, e = np.histogramdd((x*x,), bins=4); h array([25., 10., 8., 7.])

https://github.com/numpy/numpy/issues/7845 tracks this inconsistency.

The fix is now trivial: the question is, will changing the return type break people’s code?

Either we should:

1. Just change it, and hope no one is broken by it 2. Add a dtype argument: - If dtype=None, behave like np.histogram - If dtype is not specified, emit a future warning recommending to use dtype=None or dtype=float - In future, change the default to None 3. Create a new better-named function histogram_nd, which can also be created without the mistake that is https://github.com/numpy/ numpy/issues/10864.

Thoughts?

(1) sems like a no-go, taking such risks isn't justified by a minor inconsistency.

(2) is still fairly intrusive, you're emitting warnings for everyone and still force people to change their code (and if they don't they may run into a backwards compat break).

(3) is the best of these options, however is this really worth a new function? My vote would be "do nothing".

Ralf

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Eric Wieser

6:38 a.m.

It’s late and I’m probably missing something The issue is not one of range as you showed there, but of precision. Here’s the test case you’re missing: def get_err(u64): """ return the absolute error incurred by storing a uint64 in a float64 "" u64 = np.uint64(u64) return u64 - u64.astype(np.float64).astype(np.uint64) The problem starts appearing with

...

...
...
get_err(2**53 + 1)1

and only gets worse as the size of the integers increases

...

...
...
get_err(2**64 - 2*10)9223372036854775788 # this is a lot bigger than float64.eps (although as a relative error, it's similar)

Either way, such weights don’t really happen in real code I think. The counterexample I can think of is someone trying to implement fixed-precision arithmetic with large integers. The intersection of people doing both that and histogramdd is probably very small, but it’s at least plausible. Yes, there are cross-links to Python, SciPy and Matplotlib functions in the docs. Great, that was what I was unsure of. I was worried that linking to upstream projects would be sort of weird, but practicality beats purity for sure here. Eric On Fri, 27 Apr 2018 at 22:26 Ralf Gommers wrote:

...

On Wed, Apr 25, 2018 at 11:00 PM, Eric Wieser
...
wrote:

...
For precision loss of the order of float64 eps, I disagree.

I was thinking more about precision loss on the order of 1, for large 64-bit integers that can’t fit in a float64

It's late and I'm probably missing something, but:

...
...
...
np.iinfo(np.int64).max > np.finfo(np.float64).max False

Either way, such weights don't really happen in real code I think.

...
Note also that #10864 https://github.com/numpy/numpy/issues/10864 incurs deliberate precision loss of the order 10**-6 x smallest bin, which is also much larger than eps.

Yeah that's worse.

...
It’s also possible to refer users to scipy.stats.binned_statistic

That sounds like a good idea to do irrespective of whether histogramdd has problems - I had no idea those existed. Is there a precedent for referring to more feature-rich scipy functions from the basic numpy ones?

Yes, there are cross-links to Python, SciPy and Matplotlib functions in the docs. This is done with intersphinx ( https://github.com/numpy/numpy/blob/master/doc/source/conf.py#L215). Example cross-link for convolve: https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.convolve.h...

Ralf

...

On Wed, 25 Apr 2018 at 22:51 Ralf Gommers wrote:

...
On Wed, Apr 25, 2018 at 10:07 PM, Eric Wieser < wieser.eric+numpy@gmail.com> wrote:

...
what does that gain over having the user do something like result.astype()

It means that the user can use integer weights without worrying about losing precision due to an intermediate float representation.

It also means they can use higher precision values (np.longdouble) or complex weights.

None of that seems particularly important to be honest.

you’re emitting warnings for everyone

...
When there’s a risk of precision loss, that seems like the responsible thing to do.

For precision loss of the order of float64 eps, I disagree. There will be many such places in numpy and in other core libraries.

...
Users passing float weights would see no warning, I suppose.

is this really worth a new function

There ought to be a function for computing histograms with integer weights that doesn’t lose precision. Either we change the existing function to do that, or we make a new function.

It's also possible to refer users to scipy.stats.binned_statistic(_2d/dd), which provides a superset of the histogram functionality and is internally consistent because the implementations of 1d/2d call the dd one.

Ralf

...
A possible compromise: like 1, but only change the dtype of the result if a weights argument is passed.

#10864 https://github.com/numpy/numpy/issues/10864 seems like a worrying design flaw too, but I suppose that can be dealt with separately.

Eric

On Wed, 25 Apr 2018 at 21:57 Ralf Gommers wrote:

...
On Mon, Apr 9, 2018 at 10:24 PM, Eric Wieser < wieser.eric+numpy@gmail.com> wrote:

...
Numpy has three histogram functions - histogram, histogram2d, and histogramdd.

histogram is by far the most widely used, and in the absence of weights and normalization, returns an np.intp count for each bin.

histogramdd (for which histogram2d is a wrapper) returns np.float64 in all circumstances.

As a contrived comparison

>>> x = np.linspace(0, 1)>>> h, e = np.histogram(x*x, bins=4); h array([25, 10, 8, 7], dtype=int64)>>> h, e = np.histogramdd((x*x,), bins=4); h array([25., 10., 8., 7.])

https://github.com/numpy/numpy/issues/7845 tracks this inconsistency.

The fix is now trivial: the question is, will changing the return type break people’s code?

Either we should:

1. Just change it, and hope no one is broken by it 2. Add a dtype argument: - If dtype=None, behave like np.histogram - If dtype is not specified, emit a future warning recommending to use dtype=None or dtype=float - In future, change the default to None 3. Create a new better-named function histogram_nd, which can also be created without the mistake that is https://github.com/numpy/numpy/issues/10864.

Thoughts?

(1) sems like a no-go, taking such risks isn't justified by a minor inconsistency.

(2) is still fairly intrusive, you're emitting warnings for everyone and still force people to change their code (and if they don't they may run into a backwards compat break).

(3) is the best of these options, however is this really worth a new function? My vote would be "do nothing".

Ralf

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

2187

Age (days ago)

2205

Last active (days ago)

List overview

Download

8 comments

4 participants

participants (4)

Eric Wieser
Jerome Kieffer
John T. Goetz
Ralf Gommers

Changing the return type of np.histogramdd

John T. Goetz

tags

participants (4)