On Thu, Jan 30, 2014 at 08:58:20PM -0800, Larry Hastings wrote:
On 01/30/2014 05:27 PM, Steven D'Aprano wrote:
I'm hesitant to require two passes over the data in _sum. Some higher-order statistics like variance are currently implemented using two passes, but ultimately I've like to support single-pass algorithms that can operate on large but finite iterators.
But I will consider it as an option.
I'm also hesitant to make the promise that _sum will be order-independent. Addition in Python isn't: [...]
I concede that this is mostly outside my expertise, and the statistics module and the PEP were your doing. So you're the expert here and I will defer to you.
But. My dim understanding of the *whole point* of the new statistics module was that it valued correctness over raw performance. I assumed sorting values from small to large** before summing was *exactly* the sort of thing it was written to do. If all we wanted were Python's existing semantics, why bother writing statistics._sum() in the first place? Just use sum().
_sum doesn't duplicate the semantics of built-in sum(). It is sort of a hybrid of sum and math.fsum: like sum, it tries to conserve types, and give a sensible result when there are mixed types. Like fsum, it tries to be higher precision.
On the other hand, I had missed the fact that this was an internal-only method. If changing _statistics._sum so it reordered the iterable to preserve correctness wouldn't change the behavior of any supported external APIs, then obviously there's no need, and I'd prefer to leave it alone for 3.4.
Changes to _sum may be visible, because the external APIs such as mean and variance rely on it. For example, an extreme case: if I removed _sum and replaced it with math.fsum, then all of the external APIs will suddenly start outputting floats and nothing but floats. (I'm not intending to do that.) I think that it is asking too much to promise that no statistics function will ever change it's numeric result. I don't intend for them to become *less* accurate, but they might become *more* accurate. For example, currently the unit tests for variance pass with an acceptable tolerance of 1e-12 (relative error). Perhaps this needs to be documented? The random module does something similar: http://docs.python.org/3/library/random.html#notes-on-reproducibility
If you decided to change it for 3.5 and people were relying on its old behavior, that would be on them. (Though a comment saying "I might change this later" would be welcome... if true.)
If you're worried about people coming to rely on this, and thus running into trouble in the future if Counters get treated specially for (say) weighted data, then I'd accept a warning in the docs, or even a runtime warning. But not an exception.
The statistics module isn't marked as provisional. So the semantics that ship with 3.4 are going to be set in stone. Changing them later simply won't be an option--that will break code. If you want to treat Counter objects differently in the future than you do now, then I agree with Wolfgang: the best course of action would be to add an exception now. But again I'll defer to your judgment about what's best for your module.
Hmmm. Well, that's a much stronger promise of backward compatibility than I would have expected. The fact that (say) variance works with a dict is a pure accident of implementation, not advertised or promised in any way. But I'll accept your ruling. I want to reserve the right to add special handling of mappings in the future. In order of preference (highest to least) I'd like to: 1) Put a note in the documentation that handling of mappings is subject to change; 2) As above, plus raise warning.warn(); or 3) Raise an exception (this one only if you insist). -- Steven