[Python-ideas] Consider adding clip or clamp function to math

Thu Aug 4 09:20:28 EDT 2016

On Wed, Aug 03, 2016 at 11:23:06AM +1200, Greg Ewing wrote:
> David Mertz wrote:
> >It really doesn't make sense to me that a clamp() function would *limit 
> >to* a NaN.

That's what I thought too, at first, but on reading more about the 
IEEE-754 standard, I've changed my mind. Passing a NAN as bounds can be 
interpreter as "bounds is missing", i.e. "no bounds".

> Keep in mind that the NaNs involved have probably arisen
> from some other computation that went wrong, and that
> the purpose of the whole NaN system is to propagate an
> indication of that wrongness so that it's evident in the
> final result.

That's not quite right. NANs are allowed to "disappear". In fact, 
Professor Kahan has specifically written that NANs which cannot diappear 
out of a calculation are useless:

    Were there no way to get rid of NaNs, they would be as useless as
    Indefinites on CRAYs; as soon as one were encountered, computation
    would be best stopped rather than continued for an indefinite time
    to an Indefinite conclusion. That is why some operations upon NaNs
    must deliver non-NaN results. Which operations?

Page 8, https://people.eecs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF

He describes some of the conditions under which a NAN might drop out of 
a calculation. He also says that min(NAN, x) and max(NAN, x) should 
both return x, which implies that so should clamp(x, NAN, NAN).

> So here's how I see it:
> 
> clamp(NaN, y, z) is asking "Is an unknown number between
> y and z?" The answer to that is not known, so the result
> should be NaN.

I agree, and fortunately that's easily performed without any explicit 
test for NAN-ness. Given x = float('nan'), neither x < lower nor x > 
upper will ever be true, no matter what the lower and upper bounds are. 
So we'll fall through to the default and return x, which is a NAN, as 
wanted.

> clamp(x, y, NaN) is asking "Is x between y and an unknown
> number?" If x > y, the answer to that is not known, so the
> result should be NaN.

No, that's not necessarily right. That's one possible interpretation of 
setting a bounds to NAN. I've seen that referred to as "NAN poisoning", 
and it is a reasonable thing to ask for. But... 

...another interpretion, and one which is closer to the current revision 
of the IEEE-754 standard, is that clamp(x, NAN, NAN) should treat the 
NANs as "missing values", i.e. that there is no lower or upper bound. 
That would be equivalent to specifying infinities as bounds.

If you want a NAN-poisoning version of clamp(), it is easy to build it 
from a NAN-as-missing-value clamp(). If you start with NAN-poisoning, 
you can't easily get NANs-as-missing-values. So if we get only one, we 
should treat NANs as missing values, and let people build the 
NAN-poisoning version as a wrapper.

> If x < y, you might argue that the result should be y.
> But consider clamp(x, 2, 1). You're asking it to limit
> x to a value not less than 2 and not greater than 1.
> There's no such number, so arguably the result should
> be NaN.

In that case, I would raise ValueError. 

> So in summary, I think it should be:
> 
> clamp(NaN, y, z) --> NaN

Agreed. It couldn't reasonably be anything else.

> clamp(x, NaN, z) --> NaN
> clamp(x, y, NaN) --> NaN

No, both these cases should treat NAN as equivalent to no limit, and 
clamp x as appropriate. If you want a second, NAN-poisoning clamp(), 
that's your perogative, but don't force it upon everyone.

> clamp(x, y, z) --> NaN if z < y

That's a clear error, and it should raise immediately. I see no 
advantage to returning NAN in this case.

Think about why you're clamping. It's unlikely to be used just once, for 
a single calculation. You're likely to be clamping a whole series of 
values, with a fixed lower and upper bounds. The bounds are unlikely to 
be known at compile-time, but they aren't going to change from clamping 
to clamping. Something like this:

lower, upper = get_bounds()
for x in values():
    y = some_calculation(x)
    y = clamp(y, lower, upper)
    do_something_with(y)

is the most likely use-case, I think.

If lower happens to be greater than upper, that's clearly a mistake. Its 
better to get an exception immediately, rather than run through a 
million calculations and only then discover that you've ended up with a 
million NANs. It's okay if you get a few NANs, that simply indicates 
that one of your x values was a NAN, or a calculation produced a NAN. 
But if *every* calculation produces a NAN, well, that's a sign of 
breakage. Hence, better to raise straight away.

-- 
Steve