[New-bugs-announce] [issue45851] statistics.multimode is inefficient (time and space) (mode somewhat, too)

Fri Nov 19 20:42:08 EST 2021

New submission from Stefan Pochmann <stefan.pochmann at gmail.com>:

The current implementation is:

def multimode(data):
    counts = Counter(iter(data)).most_common()
    maxcount, mode_items = next(groupby(counts, key=itemgetter(1)), (0, []))
    return list(map(itemgetter(0), mode_items))

The most_common() does a complete sort of Counter item tuples, taking O(n log n) time and quite big O(n) extra space (mostly for all those tuples).

When Raymond Hettinger suggested it in https://bugs.python.org/issue35892#msg336338 he said it should have "running speed that is optimal for the desired result". But then he detailed that with "Slow O(n log n), loads all data in memory, full sort".

Which seems like a mistake, as that's not optimal. It's easy to do in O(n) time and O(1) extra memory (in addition to the Counter and result, I mean):

def multimode(data):
    counts = Counter(iter(data))
    if not counts:
        return []
    maxcount = max(counts.values())
    return [value for value, count in counts.items() if count == maxcount]

If there are only very few *different* values then the time/space after creating the Counter is insignificant compared to the Counter creation. But if there are many different values, it can be significant.

statistics.mode takes O(n) time and O(1) space, which is optimal, but I found an apparently faster way anyway (code at end).

For example for data = random.choices(range(n), k=n):

           |     multimode     |       mode
     n     | current  proposal | current  proposal
-----------+-------------------+------------------
         1 |    131%    70%    |    125%   58%
        10 |    144%    73%    |    119%   53%
       100 |    126%    71%    |    108%   29%
     1,000 |    123%    65%    |     62%   22%
    10,000 |    172%    55%    |     53%   18%
   100,000 |    164%    44%    |     55%   20%
 1,000,000 |     85%    20%    |     22%    4%
10,000,000 |     56%    12%    |     11%    4%

All four start with Counter(iter(data)), so I took that as baseline and the above results show relative additional times. For example 55% means if Counter construction alone took 10 seconds, the function took 15.5 seconds.

An extreme case, data = list(range(n)):

           |     multimode    |       mode
     n     | current proposal | current proposal
-----------+------------------+-----------------
         1 |   128%   67%     |   124%   56%
        10 |   187%   93%     |   141%   52%
       100 |   316%  149%     |   181%   45%
     1,000 |   380%  174%     |   213%   46%
    10,000 |   349%  111%     |   146%   30%
   100,000 |   397%  128%     |   159%   34%
 1,000,000 |   336%   95%     |   112%   24%
10,000,000 |   349%   97%     |   109%   23%

I also tried a bunch of other cases, didn't find one where my versions weren't quite a bit faster.

My mode() version:

from operator import indexOf
from itertools import islice

def mode(data):
    counts = Counter(iter(data))
    if not counts:
        raise StatisticsError('no mode for empty data') from None
    maxcount = max(counts.values())
    index = indexOf(counts.values(), maxcount)
    return next(islice(counts, index, None))

----------
components: Library (Lib)
messages: 406638
nosy: Stefan Pochmann
priority: normal
severity: normal
status: open
title: statistics.multimode is inefficient (time and space) (mode somewhat, too)
type: performance
versions: Python 3.10

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue45851>
_______________________________________