[Python-ideas] Pre-PEP: adding a statistics module to Python

Mon Aug 5 04:02:38 CEST 2013

On 5 August 2013 02:40, Steven D'Aprano <steve at pearwood.info> wrote:

> On 04/08/13 22:51, Eli Bendersky wrote:
>
>> On Sun, Aug 4, 2013 at 12:07 AM, Ethan Furman <ethan at stoneleaf.us> wrote:
>>
>>> On 08/03/2013 07:00 PM, Eli Bendersky wrote:
>>>
>>>>
>>>>
>>>> While I'm somewhat -0.5 on the general idea of the statistics module
>>>> (competing with well-established, super-optimized and
>>>> by-themselves-famous numeric libraries Python has does not sound like
>>>> a worthy goal),
>>>>
>>>
>>>
>>> Sure, competing with already established libraries is silly.
>>>  Fortunately,
>>> that's not what is happening here.  This PEP is about providing a
>>> minimal,
>>> common set of statistics functions for the average person.
>>>
>>
>> I'm really not sure who this average person is, but everyone keeps
>> talking about him. Is it the same person for whom Dummies books are
>> written?
>>
>> Anyhow, "minimal" is a dangerous slope. With such a module in the
>> stdlib, I'm 100% sure we'll get a constant stream of - please add just
>> this function (from SciPy) - it's so useful to the "average person" -
>> requests. This is unavoidable. And it will be difficult to judge at
>> that point why certain funcitonality belongs or does not belong here.
>> So over time we'll end up with a partial Greenspun, by containing an
>> ad hoc, slow implementation of half of Numpy/SciPy.
>>
>
> [only half serious]
> Perhaps we should have a pure-Python implementation of numpy/scipy, for
> non-C based Pythons. If I recall correctly, PyPy had to engage in a massive
> effort to get numpy even partially working. The pure-Python part of the
> stdlib is not just the stdlib for CPython, but potentially for the entire
> Python universe.
>
>
>
>  Efforts are better spent in writing a new tutorial on Numpy that shows
>> how to do the stuff statistics.py does. Call it "Numpy statistics for
>> the average person".
>>
>
> That does not help those who are unable to install numpy due to
> restrictive policies about what software can be installed.
>
> The choice is not either statistics or better tutorials. We can have both,
> if somebody volunteers to write those tutorials, or neither. I am not
> volunteering to write numpy tutorials.
>
>
>
>    I have to agree with Alexander w.r.t. "sum". Strongly
>>>> -1 from me on having functions with the same name as existing stdlib
>>>> functions but different functionality. This is very much unpythonic.
>>>>
>>>
>>>
>>> I thought the whole point of name spaces was to be able to have the same
>>> name mean different things in different contexts.  Surely no one expects
>>> to
>>> be able to use `webbrowser.open` or `gzip.open` anywhere `open` can be
>>> used.
>>>
>>
>> This is not a fair comparison. As a pop quiz, try to imagine the
>> difference between 'open' and 'gzip.open' - do you immediately come up
>> with the differences in their functionalities? Now, how about 'sum'
>> and 'statistics.sum'?
>>
>
> As far as gzip.open goes, I have no idea. Like most people, I expect that
> there is some difference -- perhaps it only works on gzip files? is the API
> different in some way? -- but beyond that vague idea that "it is in a
> different module, therefore it must be different *somehow*" I have have no
> idea how it actually differs from the built-in, or codecs.open. I would
> have to look them up to find out what the differences actually are.
>
> I expect that any even moderately competent user will think the same way:
> "statistics.sum is in a different module, presumably it is different
> somehow, I should look it up to find out how".
>

As I'd said somewhere earlier, the name should be such that you only have
to know the name to know whether it's relevant. I don't believe you if you
say you thought gzip.open had nothing to do with gzip -- you know at least
that you can ignore it until you're interested in gzip files.

I don't expect people to know this without being told. Frankly, I don't
> even expect the typical numerically naive user to use statistics.sum when
> it is so much shorter to type "sum". I can provide a better numeric sum,
> but I can't force people to use it. But the statistics module uses it
> extensively, neither the built-in sum nor math.fsum are suitable for my
> purposes, and I wish to expose that functionality to users who are willing
> to use it.
>

I can agree that this shouldn't be a replacement for builtins.sum but I
don't think that it shouldn't be obvious what solution it solves. If you're
coming up with inaccurate sums a name like "precise_sum" would be very
guiding. "statistics.sum" doesn't hint at the differences.

To go back to gzip again, you'll know what it does whenever it's relevant.
The same is not true of a miscellaneous "sum" from a "statistics" module.

> And finally, I categorically refuse to call it any variation of
> statistics.ssum or statistics.statistics_sum. We have namespaces for a
> reason. If anyone wishes to start bikeshedding names, I will consider
> reasonable alternatives that don't repeat the name of the namespace.
>

I don't think anyone proposed that. I've proposed "precise_sum" as a
possibility although there's probably a shorter variant somewhere.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130805/ad746d01/attachment-0001.html>