[Python-ideas] Pre-PEP: adding a statistics module to Python

Wed Aug 7 23:29:49 CEST 2013

On 7 August 2013 17:01, Andrew Barnert <abarnert at yahoo.com> wrote:
> On Aug 7, 2013, at 4:10, Oscar Benjamin <oscar.j.benjamin at gmail.com> wrote:
>
> On Aug 6, 2013 11:19 PM, "Andrew Barnert" <abarnert at yahoo.com> wrote:
>>
>> On Aug 6, 2013, at 12:44, Michele Lacchia <michelelacchia at gmail.com>
>> wrote:
>>>
>>> Yes but then you lose all the advantages of iterators. What's the point
>>> in that?
>>> Furthermore it's not guaranteed that you can always converting an
>>> iterator into a list. As it has already been said, you could run out of
>>> memory, for instance.
>>
>> And the places where the stdlib/builtins do that automatic
>> conversion--even when it's well motivated and almost always harmless once
>> you think about it, like str.join--are surprising to most people. (Following
>> up on str.join as an example, just about every question whose answer is
>> str.join([...]) ends up with someone suggesting a genexpr instead of a
>> listcomp, someone else explaining that it doesn't actually save any memory
>> in that case, just wastes a bit of time, then some back and forth until
>> everyone finally gets it.)
>>
>> The question is whether it would be even _more_ surprising to return an
>> error, or a less accurate result. I don't know the answer to that.
>
> I'm going to make the claim (with no supporting data) that more than 95% of
> the time, when a user calls variance(iterator) they will be guilty of
> premature optimisation.
>
> I think you're probably right. In the similar cases that come up with, e.g.,
> str.join(iterator), there is usually no reason whatsoever to believe that
> any memory or speed cost will make any difference. Often people get into
> arguments over a half dozen strings (where, even if it _did_ matter, which
> it doesn't, N is so low that algorithmic complexity isn't even relevant).
>
> Really the cases where you can't build a collection are rare. People will
> still do it though just because it's satisfying to do everything with
> iterators in constant memory (I'm often guilty of this kind of thing).
>
> Or so that a sequence of operations can be pipelined, possibly leading to
> better cache behavior. Or just because iterators are the pythonic (or
> python3-ic?) way to do it.
>
> However unlike str.join there's no one pass algorithm that can be as
> accurate so it's not purely a performance question.
>
> But the point is that str.join doesn't use a one-pass algorithm, it just
> constructs a list so it can do it in two passes. And it's been suggested on
> this thread that variance could easily do the same thing.
>
> So there are three choices. Using a one-pass algorithm would be surprising
> because it's less accurate. Automatic listification would be surprising
> because you went out of your way to pass lazy iterators around and variance
> broke the benefits. An exception would be surprising because almost every
> other function in the stdlib that takes lists also takes iterators, even
> when there are good reasons not to.
>
> I think you still may be right that the error is the way to go. You'll learn
> the problem quickly, and the workaround will be obvious, and the reason for
> it will be available in the docs. The other two potential surprises may not
> be as discoverable.

My preference is a documentation warning and leave it at that *or*
automatic coercion to a list. Anything else is treating this issue as
a *way* bigger deal than it currently is.

The advantage of the first is that it allows one-pass algorithms. The
advantage of the second is correctness by default. Whichever is
preferred, I'd really rather not do some of the other workarounds like
extra arguments or raising errors.

One pass algorithms are important if speed is important.

Correctness by default is important because this is a library that
values correctness over speed (...I think we have our answer). Losing
correctness to allow for faster algorithms is, in my opinion, anathema
to the purpose of this library.