[Python-ideas] Consider adding clip or clamp function to math
Steven D'Aprano
steve at pearwood.info
Thu Aug 4 09:20:28 EDT 2016
On Wed, Aug 03, 2016 at 11:23:06AM +1200, Greg Ewing wrote:
> David Mertz wrote:
> >It really doesn't make sense to me that a clamp() function would *limit
> >to* a NaN.
That's what I thought too, at first, but on reading more about the
IEEE-754 standard, I've changed my mind. Passing a NAN as bounds can be
interpreter as "bounds is missing", i.e. "no bounds".
> Keep in mind that the NaNs involved have probably arisen
> from some other computation that went wrong, and that
> the purpose of the whole NaN system is to propagate an
> indication of that wrongness so that it's evident in the
> final result.
That's not quite right. NANs are allowed to "disappear". In fact,
Professor Kahan has specifically written that NANs which cannot diappear
out of a calculation are useless:
Were there no way to get rid of NaNs, they would be as useless as
Indefinites on CRAYs; as soon as one were encountered, computation
would be best stopped rather than continued for an indefinite time
to an Indefinite conclusion. That is why some operations upon NaNs
must deliver non-NaN results. Which operations?
Page 8, https://people.eecs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF
He describes some of the conditions under which a NAN might drop out of
a calculation. He also says that min(NAN, x) and max(NAN, x) should
both return x, which implies that so should clamp(x, NAN, NAN).
> So here's how I see it:
>
> clamp(NaN, y, z) is asking "Is an unknown number between
> y and z?" The answer to that is not known, so the result
> should be NaN.
I agree, and fortunately that's easily performed without any explicit
test for NAN-ness. Given x = float('nan'), neither x < lower nor x >
upper will ever be true, no matter what the lower and upper bounds are.
So we'll fall through to the default and return x, which is a NAN, as
wanted.
> clamp(x, y, NaN) is asking "Is x between y and an unknown
> number?" If x > y, the answer to that is not known, so the
> result should be NaN.
No, that's not necessarily right. That's one possible interpretation of
setting a bounds to NAN. I've seen that referred to as "NAN poisoning",
and it is a reasonable thing to ask for. But...
...another interpretion, and one which is closer to the current revision
of the IEEE-754 standard, is that clamp(x, NAN, NAN) should treat the
NANs as "missing values", i.e. that there is no lower or upper bound.
That would be equivalent to specifying infinities as bounds.
If you want a NAN-poisoning version of clamp(), it is easy to build it
from a NAN-as-missing-value clamp(). If you start with NAN-poisoning,
you can't easily get NANs-as-missing-values. So if we get only one, we
should treat NANs as missing values, and let people build the
NAN-poisoning version as a wrapper.
> If x < y, you might argue that the result should be y.
> But consider clamp(x, 2, 1). You're asking it to limit
> x to a value not less than 2 and not greater than 1.
> There's no such number, so arguably the result should
> be NaN.
In that case, I would raise ValueError.
> So in summary, I think it should be:
>
> clamp(NaN, y, z) --> NaN
Agreed. It couldn't reasonably be anything else.
> clamp(x, NaN, z) --> NaN
> clamp(x, y, NaN) --> NaN
No, both these cases should treat NAN as equivalent to no limit, and
clamp x as appropriate. If you want a second, NAN-poisoning clamp(),
that's your perogative, but don't force it upon everyone.
> clamp(x, y, z) --> NaN if z < y
That's a clear error, and it should raise immediately. I see no
advantage to returning NAN in this case.
Think about why you're clamping. It's unlikely to be used just once, for
a single calculation. You're likely to be clamping a whole series of
values, with a fixed lower and upper bounds. The bounds are unlikely to
be known at compile-time, but they aren't going to change from clamping
to clamping. Something like this:
lower, upper = get_bounds()
for x in values():
y = some_calculation(x)
y = clamp(y, lower, upper)
do_something_with(y)
is the most likely use-case, I think.
If lower happens to be greater than upper, that's clearly a mistake. Its
better to get an exception immediately, rather than run through a
million calculations and only then discover that you've ended up with a
million NANs. It's okay if you get a few NANs, that simply indicates
that one of your x values was a NAN, or a calculation produced a NAN.
But if *every* calculation produces a NAN, well, that's a sign of
breakage. Hence, better to raise straight away.
--
Steve
More information about the Python-ideas
mailing list