[Python-ideas] Pre-PEP: adding a statistics module to Python

Mon Aug 5 04:24:48 CEST 2013

On 05/08/13 12:02, Joshua Landau wrote:

>>> >>This is not a fair comparison. As a pop quiz, try to imagine the
>>> >>difference between 'open' and 'gzip.open' - do you immediately come up
>>> >>with the differences in their functionalities? Now, how about 'sum'
>>> >>and 'statistics.sum'?
>>> >>
>> >
>> >As far as gzip.open goes, I have no idea. Like most people, I expect that
>> >there is some difference -- perhaps it only works on gzip files? is the API
>> >different in some way? -- but beyond that vague idea that "it is in a
>> >different module, therefore it must be different*somehow*" I have have no
>> >idea how it actually differs from the built-in, or codecs.open. I would
>> >have to look them up to find out what the differences actually are.
>> >
>> >I expect that any even moderately competent user will think the same way:
>> >"statistics.sum is in a different module, presumably it is different
>> >somehow, I should look it up to find out how".
>> >
> As I'd said somewhere earlier, the name should be such that you only have
> to know the name to know whether it's relevant. I don't believe you if you
> say you thought gzip.open had nothing to do with gzip -- you know at least
> that you can ignore it until you're interested in gzip files.

I didn't say that I thought gzip.open had "nothing" to do with gzip. I said I didn't know if it *only* works on gzip files. Without looking it up, perhaps it does the equivalent of:

def open(filename, *args):
     if filename.endswith('gzip'):
         ...
     else:
         return builtins.open(filename, *args)

I probably wouldn't write it that way, but I didn't write the gzip module and I can't rule it out without checking the docs or the source.

The point is that any reasonably competent user will expect that there is *some* difference between two otherwise similar names in different namespaces, but it is asking too much to expect the name alone to clue them in on all the differences. Or even *any* of the differences. One might legitimately have artist.draw() and gunslinger.draw() methods, and somebody ignorant of art or Western gunslingers may have no idea what the differences are.

[...]
> I can agree that this shouldn't be a replacement for builtins.sum but I
> don't think that it shouldn't be obvious what solution it solves. If you're
> coming up with inaccurate sums a name like "precise_sum" would be very
> guiding. "statistics.sum" doesn't hint at the differences.

Built-in sum is infinitely precise if you pass it ints or Fractions. math.fsum is also high-precision (although not infinitely so), but it coerces everything to floats. If we're going to insist that the name makes it obvious what problem it solves, we'll end up with a name like

statistics.high_precision_numeric_only_sum_without_coercing_to_float()

which is just ridiculous. Obviously some differences will remain non-obvious. Reading the name is not a substitute for reading the docs.

> To go back to gzip again, you'll know what it does whenever it's relevant.
> The same is not true of a miscellaneous "sum" from a "statistics" module.

It's the sum you should use when you are doing statistics. If you want to know *why* you should use it rather than built-in sum, read the docs, or ask on python-list at python.org. There's only so much knowledge that can be encoded into a single name.

-- 
Steven