[Python-ideas] statistics module in Python3.4
Steven D'Aprano
steve at pearwood.info
Fri Jan 31 09:56:13 CET 2014
On Thu, Jan 30, 2014 at 08:58:20PM -0800, Larry Hastings wrote:
> On 01/30/2014 05:27 PM, Steven D'Aprano wrote:
> >I'm hesitant to require two passes over the data in _sum. Some
> >higher-order statistics like variance are currently implemented using
> >two passes, but ultimately I've like to support single-pass algorithms
> >that can operate on large but finite iterators.
> >
> >But I will consider it as an option.
> >
> >I'm also hesitant to make the promise that _sum will be
> >order-independent. Addition in Python isn't: [...]
>
> I concede that this is mostly outside my expertise, and the statistics
> module and the PEP were your doing. So you're the expert here and I
> will defer to you.
>
> But. My dim understanding of the *whole point* of the new statistics
> module was that it valued correctness over raw performance. I assumed
> sorting values from small to large** before summing was *exactly* the
> sort of thing it was written to do. If all we wanted were Python's
> existing semantics, why bother writing statistics._sum() in the first
> place? Just use sum().
_sum doesn't duplicate the semantics of built-in sum(). It is sort
of a hybrid of sum and math.fsum: like sum, it tries to conserve types,
and give a sensible result when there are mixed types. Like fsum, it
tries to be higher precision.
> On the other hand, I had missed the fact that this was an internal-only
> method. If changing _statistics._sum so it reordered the iterable to
> preserve correctness wouldn't change the behavior of any supported
> external APIs, then obviously there's no need, and I'd prefer to leave
> it alone for 3.4.
Changes to _sum may be visible, because the external APIs such as mean
and variance rely on it. For example, an extreme case: if I removed _sum
and replaced it with math.fsum, then all of the external APIs will
suddenly start outputting floats and nothing but floats. (I'm not
intending to do that.)
I think that it is asking too much to promise that no statistics
function will ever change it's numeric result. I don't intend for them
to become *less* accurate, but they might become *more* accurate. For
example, currently the unit tests for variance pass with an acceptable
tolerance of 1e-12 (relative error). Perhaps this needs to be
documented? The random module does something similar:
http://docs.python.org/3/library/random.html#notes-on-reproducibility
> If you decided to change it for 3.5 and people were
> relying on its old behavior, that would be on them. (Though a comment
> saying "I might change this later" would be welcome... if true.)
>
>
> >If you're worried about people coming to rely on this, and thus running
> >into trouble in the future if Counters get treated specially for (say)
> >weighted data, then I'd accept a warning in the docs, or even a runtime
> >warning. But not an exception.
>
> The statistics module isn't marked as provisional. So the semantics
> that ship with 3.4 are going to be set in stone. Changing them later
> simply won't be an option--that will break code. If you want to treat
> Counter objects differently in the future than you do now, then I agree
> with Wolfgang: the best course of action would be to add an exception
> now. But again I'll defer to your judgment about what's best for your
> module.
Hmmm. Well, that's a much stronger promise of backward compatibility
than I would have expected. The fact that (say) variance works with a
dict is a pure accident of implementation, not advertised or promised in
any way. But I'll accept your ruling. I want to reserve the right to
add special handling of mappings in the future. In order of preference
(highest to least) I'd like to:
1) Put a note in the documentation that handling of mappings is subject
to change;
2) As above, plus raise warning.warn(); or
3) Raise an exception (this one only if you insist).
--
Steven
More information about the Python-ideas
mailing list