On 12/30/19 11:54 AM, David Mertz wrote:
On Mon, Dec 30, 2019 at 3:32 AM Andrew Barnert via Python-ideas <python-ideas@python.org <mailto:python-ideas@python.org>> wrote:
On Dec 29, 2019, at 23:50, Steven D'Aprano <steve@pearwood.info <mailto:steve@pearwood.info>> wrote: > > On Sun, Dec 29, 2019 at 06:23:03PM -0800, Andrew Barnert via Python-ideas wrote: > >> Likewise, it’s even easier to write ignore-nan yourself than to write the DSU yourself: >> >> median = statistics.median(x for x in xs if not x.isnan()) > > Try that with xs = [1, 10**400, 2] and come back to me.
Presumably the end user (unlike the statistics module) knows what data they have.
No, Steven is right here. In Python we might very sensibly mix numeric datatypes. But this means we need an `is_nan()` function like some discussed in these threads, not rely on a method (and not the same behavior as math.isnan()).
E.g.:
my_data = {'observation1': 10**400, # really big amount 'observation2': 1, # ordinary size 'observation3': 2.0, # ordinary size 'observation4': math.nan # missing data }
median = statistics.median_high(x for x in my_data if not is_nan(x))
The answer '2.0' is plainly right here, and there's no reason we shouldn't provide it.
My preference is that the interpretation that NaN means Missing Data isn't appropriate for for the statistics module. In your code, because you put the filter in, YOU added that meaning which is ok, but I see no grounds to say that statistics.median(my_data) MUST be 2.0, and several other logical results have been presented. For instance, if your last point was defined as 1e400-1e399, which results in a nan, then 2.0 is NOT the reasonable answer, but from the numbers (before we lost precision to the subtraction of infinities) be inf, or maybe something close to 4.5e399 had the e notation numbers not overflow to infinity, but stayed big nums or decimals. Since Python DOES support the mixed type arrays, I see no reason that Python needs to adopt the ancient domain specific (and not universal in the domain) usage of nan as missing data, but instead the Python Idiom should more likely be something line None (which gets around the difficulty of detecting the multiple forms on nan). Now one issue with your example, which may be the point, is that currently the documentation of median says it does NOT support mixed type list, like given above, but it does seem to handle it as long as the comparison function gives reasonable results, I suspect that there are some combination of extreme values of differing types where the comparison function fails, and I am not sure if there is a easy solution to make ALL the Number classes always comparable to each other, one issue being that what type to do the comparison in most efficiently is value dependent (magnitude and how close the values are together). -- Richard Damon