[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

Raymond Hettinger report at bugs.python.org
Wed Feb 27 02:16:35 EST 2019


Raymond Hettinger <raymond.hettinger at gmail.com> added the comment:

> Are you happy guaranteeing that it will always be the first
> mode encountered?

Yes.  

All of the other implementations I looked at make some guarantee about which mode is returned.  Maple, Matlab, and Excel all return the first encountered.¹  That is convenient for us because it is what Counter(data).most_common(1) already does and does cheaply (single pass, no auxiliary memory).  It also matches what a number of our other tools do:

>>> max(3, 3.0)       # 3 is first encountered
3
>>> max(3.0, 3)       # 3.0 is first encountered
3.0
>>> list(dict.fromkeys('aabbaacc'))[0] # 'a' is first encountered
'a'
>>> sorted([3, 3.0])[0]  # 3 is first encountered (due to sort stability)
3
>>> sorted([3.0, 3])[0]  # 3.0 is first encountered (due to sort stability)
3.0

¹ Scipy returned the smallest value rather than first value but it algorithm was sorting based to accommodate running a parallel mode() computation on multiple columns of an array. For us, that approach would be much slow, would require more memory, and would require more bookkeeping. 

P.S. I'm no longer thinking this should be saved for Pycon sprints.  That is right at the beta 1 feature freeze.  We should aim to get this resolved well in advance of that.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue35892>
_______________________________________


More information about the Python-bugs-list mailing list