[Python-ideas] statistics module in Python3.4

Fri Jan 31 09:56:13 CET 2014

On Thu, Jan 30, 2014 at 08:58:20PM -0800, Larry Hastings wrote:
> On 01/30/2014 05:27 PM, Steven D'Aprano wrote:
> >I'm hesitant to require two passes over the data in _sum. Some
> >higher-order statistics like variance are currently implemented using
> >two passes, but ultimately I've like to support single-pass algorithms
> >that can operate on large but finite iterators.
> >
> >But I will consider it as an option.
> >
> >I'm also hesitant to make the promise that _sum will be
> >order-independent. Addition in Python isn't: [...]
> 
> I concede that this is mostly outside my expertise, and the statistics 
> module and the PEP were your doing.  So you're the expert here and I 
> will defer to you.
> 
> But.  My dim understanding of the *whole point* of the new statistics 
> module was that it valued correctness over raw performance.  I assumed 
> sorting values from small to large** before summing was *exactly* the 
> sort of thing it was written to do.  If all we wanted were Python's 
> existing semantics, why bother writing statistics._sum() in the first 
> place?  Just use sum().

_sum doesn't duplicate the semantics of built-in sum(). It is sort 
of a hybrid of sum and math.fsum: like sum, it tries to conserve types, 
and give a sensible result when there are mixed types. Like fsum, it 
tries to be higher precision.

> On the other hand, I had missed the fact that this was an internal-only 
> method.  If changing _statistics._sum so it reordered the iterable to 
> preserve correctness wouldn't change the behavior of any supported 
> external APIs, then obviously there's no need, and I'd prefer to leave 
> it alone for 3.4.

Changes to _sum may be visible, because the external APIs such as mean 
and variance rely on it. For example, an extreme case: if I removed _sum 
and replaced it with math.fsum, then all of the external APIs will 
suddenly start outputting floats and nothing but floats. (I'm not 
intending to do that.)

I think that it is asking too much to promise that no statistics 
function will ever change it's numeric result. I don't intend for them 
to become *less* accurate, but they might become *more* accurate. For 
example, currently the unit tests for variance pass with an acceptable 
tolerance of 1e-12 (relative error). Perhaps this needs to be 
documented? The random module does something similar:

http://docs.python.org/3/library/random.html#notes-on-reproducibility

> If you decided to change it for 3.5 and people were 
> relying on its old behavior, that would be on them.  (Though a comment 
> saying "I might change this later" would be welcome... if true.)
> 
> 
> >If you're worried about people coming to rely on this, and thus running
> >into trouble in the future if Counters get treated specially for (say)
> >weighted data, then I'd accept a warning in the docs, or even a runtime
> >warning. But not an exception.
> 
> The statistics module isn't marked as provisional.  So the semantics 
> that ship with 3.4 are going to be set in stone.  Changing them later 
> simply won't be an option--that will break code.  If you want to treat 
> Counter objects differently in the future than you do now, then I agree 
> with Wolfgang: the best course of action would be to add an exception 
> now.  But again I'll defer to your judgment about what's best for your 
> module.

Hmmm. Well, that's a much stronger promise of backward compatibility 
than I would have expected. The fact that (say) variance works with a 
dict is a pure accident of implementation, not advertised or promised in 
any way. But I'll accept your ruling. I want to reserve the right to 
add special handling of mappings in the future. In order of preference 
(highest to least) I'd like to:

1) Put a note in the documentation that handling of mappings is subject 
to change;

2) As above, plus raise warning.warn(); or

3) Raise an exception (this one only if you insist).

-- 
Steven