Re: [Python-ideas] statistics module in Python3.4
On Thu, Jan 30, 2014 at 11:03:38AM -0800, Larry Hastings wrote:
On Mon, Jan 27, 2014 at 9:41 AM, Wolfgang <wolfgang.maier@biologie.uni-freiburg.de <mailto:wolfgang.maier@biologie.uni-freiburg.de>> wrote:
I think a much cleaner (and probably faster) implementation would be to gather first all the types in the input sequence, then decide what to return in an input order independent way.
I'm willing to consider this a "bug fix". And since it's a new function in 3.4, we don't have an installed base. So I'm willing to consider fixing this for 3.4.
I'm hesitant to require two passes over the data in _sum. Some higher-order statistics like variance are currently implemented using two passes, but ultimately I've like to support single-pass algorithms that can operate on large but finite iterators. But I will consider it as an option. I'm also hesitant to make the promise that _sum will be order-independent. Addition in Python isn't: py> class A(int): ... def __add__(self, other): ... return type(self)(super().__add__(other)) ... def __repr__(self): ... return "%s(%d)" % (type(self).__name__, self) ... py> class B(A): ... pass ... py> A(1) + B(1) A(2) py> B(1) + A(1) B(2) [...]
Yes, exactly. If the support for Counter is half-baked, let's prevent it from being used now.
I strongly disagree with this. Counters are currently treated the same as any other iterable, and built-in sum and math.fsum don't treat them specially: py> from collections import Counter py> c = Counter([1, 1, 1, 1, 1, 2]) py> c Counter({1: 5, 2: 1}) py> sum(c) 3 py> from math import fsum py> fsum(c) 3.0 If you're worried about people coming to rely on this, and thus running into trouble in the future if Counters get treated specially for (say) weighted data, then I'd accept a warning in the docs, or even a runtime warning. But not an exception. -- Steven
On 01/30/2014 05:27 PM, Steven D'Aprano wrote:
I'm hesitant to require two passes over the data in _sum. Some higher-order statistics like variance are currently implemented using two passes, but ultimately I've like to support single-pass algorithms that can operate on large but finite iterators.
But I will consider it as an option.
I'm also hesitant to make the promise that _sum will be order-independent. Addition in Python isn't: [...]
I concede that this is mostly outside my expertise, and the statistics module and the PEP were your doing. So you're the expert here and I will defer to you. But. My dim understanding of the *whole point* of the new statistics module was that it valued correctness over raw performance. I assumed sorting values from small to large** before summing was *exactly* the sort of thing it was written to do. If all we wanted were Python's existing semantics, why bother writing statistics._sum() in the first place? Just use sum(). On the other hand, I had missed the fact that this was an internal-only method. If changing _statistics._sum so it reordered the iterable to preserve correctness wouldn't change the behavior of any supported external APIs, then obviously there's no need, and I'd prefer to leave it alone for 3.4. If you decided to change it for 3.5 and people were relying on its old behavior, that would be on them. (Though a comment saying "I might change this later" would be welcome... if true.)
If you're worried about people coming to rely on this, and thus running into trouble in the future if Counters get treated specially for (say) weighted data, then I'd accept a warning in the docs, or even a runtime warning. But not an exception.
The statistics module isn't marked as provisional. So the semantics that ship with 3.4 are going to be set in stone. Changing them later simply won't be an option--that will break code. If you want to treat Counter objects differently in the future than you do now, then I agree with Wolfgang: the best course of action would be to add an exception now. But again I'll defer to your judgment about what's best for your module. //arry/ ** Or high-precision to low-precision. You know what I mean, the classic "if you add large numbers first you throw away precision and can wind up with a different result" thing.
On 1/30/2014 11:58 PM, Larry Hastings wrote:
The statistics module isn't marked as provisional.
Perhaps it should be, at least with respect to sums of mixed types and use of Counters.
So the semantics that ship with 3.4 are going to be set in stone.
Given the discussion here and previously, that seems premature. -- Terry Jan Reedy
On Thu, Jan 30, 2014 at 08:58:20PM -0800, Larry Hastings wrote:
On 01/30/2014 05:27 PM, Steven D'Aprano wrote:
I'm hesitant to require two passes over the data in _sum. Some higher-order statistics like variance are currently implemented using two passes, but ultimately I've like to support single-pass algorithms that can operate on large but finite iterators.
But I will consider it as an option.
I'm also hesitant to make the promise that _sum will be order-independent. Addition in Python isn't: [...]
I concede that this is mostly outside my expertise, and the statistics module and the PEP were your doing. So you're the expert here and I will defer to you.
But. My dim understanding of the *whole point* of the new statistics module was that it valued correctness over raw performance. I assumed sorting values from small to large** before summing was *exactly* the sort of thing it was written to do. If all we wanted were Python's existing semantics, why bother writing statistics._sum() in the first place? Just use sum().
_sum doesn't duplicate the semantics of built-in sum(). It is sort of a hybrid of sum and math.fsum: like sum, it tries to conserve types, and give a sensible result when there are mixed types. Like fsum, it tries to be higher precision.
On the other hand, I had missed the fact that this was an internal-only method. If changing _statistics._sum so it reordered the iterable to preserve correctness wouldn't change the behavior of any supported external APIs, then obviously there's no need, and I'd prefer to leave it alone for 3.4.
Changes to _sum may be visible, because the external APIs such as mean and variance rely on it. For example, an extreme case: if I removed _sum and replaced it with math.fsum, then all of the external APIs will suddenly start outputting floats and nothing but floats. (I'm not intending to do that.) I think that it is asking too much to promise that no statistics function will ever change it's numeric result. I don't intend for them to become *less* accurate, but they might become *more* accurate. For example, currently the unit tests for variance pass with an acceptable tolerance of 1e-12 (relative error). Perhaps this needs to be documented? The random module does something similar: http://docs.python.org/3/library/random.html#notes-on-reproducibility
If you decided to change it for 3.5 and people were relying on its old behavior, that would be on them. (Though a comment saying "I might change this later" would be welcome... if true.)
If you're worried about people coming to rely on this, and thus running into trouble in the future if Counters get treated specially for (say) weighted data, then I'd accept a warning in the docs, or even a runtime warning. But not an exception.
The statistics module isn't marked as provisional. So the semantics that ship with 3.4 are going to be set in stone. Changing them later simply won't be an option--that will break code. If you want to treat Counter objects differently in the future than you do now, then I agree with Wolfgang: the best course of action would be to add an exception now. But again I'll defer to your judgment about what's best for your module.
Hmmm. Well, that's a much stronger promise of backward compatibility than I would have expected. The fact that (say) variance works with a dict is a pure accident of implementation, not advertised or promised in any way. But I'll accept your ruling. I want to reserve the right to add special handling of mappings in the future. In order of preference (highest to least) I'd like to: 1) Put a note in the documentation that handling of mappings is subject to change; 2) As above, plus raise warning.warn(); or 3) Raise an exception (this one only if you insist). -- Steven
Steven D'Aprano <steve@...> writes:
On Thu, Jan 30, 2014 at 08:58:20PM -0800, Larry Hastings wrote:
The statistics module isn't marked as provisional. So the semantics that ship with 3.4 are going to be set in stone. Changing them later simply won't be an option--that will break code. If you want to treat Counter objects differently in the future than you do now, then I agree with Wolfgang: the best course of action would be to add an exception now. But again I'll defer to your judgment about what's best for your module.
Hmmm. Well, that's a much stronger promise of backward compatibility than I would have expected. The fact that (say) variance works with a dict is a pure accident of implementation, not advertised or promised in any way. But I'll accept your ruling. I want to reserve the right to add special handling of mappings in the future. In order of preference (highest to least) I'd like to:
1) Put a note in the documentation that handling of mappings is subject to change;
2) As above, plus raise warning.warn(); or
3) Raise an exception (this one only if you insist).
I thought about this further and, yes, I guess at least point 1) is essential and even if that means marking the module as provisional it is a bit sad, but worth it. Mappings may be an excellent way of specifying frequencies and weights in an elegant way. You could use them to calculate weighted means and variances, and even to specify variable interval widths for median_grouped to calculate a weighted median as defined here: http://en.wikipedia.org/wiki/Weighted_median Most of this is quite easy to code I guess and it would be a pity to deprive yourself of this possibility because people start passing mappings now and start relying on iteration happening over keys only. I agree with Larry that once this happens, it will be hard to change the behavior even in 3.5. Best, Wolfgang
Wolfgang Maier wrote:
Mappings may be an excellent way of specifying frequencies and weights in an elegant way.
That may be, but I think I'd rather have a separate function, or a mode argument, explicitly indicating that this is what you want. Detecting whether something is a mapping in a duck-typed way is dodgy in general. -- Greg
Greg Ewing <greg.ewing@...> writes:
Wolfgang Maier wrote:
Mappings may be an excellent way of specifying frequencies and weights in an elegant way.
That may be, but I think I'd rather have a separate function, or a mode argument, explicitly indicating that this is what you want. Detecting whether something is a mapping in a duck-typed way is dodgy in general.
There should be nothing dodgy about this with the abstract base class Mapping. Best, Wolfgang
On 1 February 2014 20:10, Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> wrote:
Greg Ewing <greg.ewing@...> writes:
Wolfgang Maier wrote:
Mappings may be an excellent way of specifying frequencies and weights in an elegant way.
That may be, but I think I'd rather have a separate function, or a mode argument, explicitly indicating that this is what you want. Detecting whether something is a mapping in a duck-typed way is dodgy in general.
There should be nothing dodgy about this with the abstract base class Mapping.
I agree with Greg about this. I dislike APIs that try to be too clever about accepting different types of input. The mode() function is clearly intended to accept an iterable not a Counter and I consider it a bug that it does. It can be fixed to treat a Counter as an iterable by changing the _counts function to do collections.Counter(data) to collections.Counter(iter(data)) If you want the option for doing statistics with Counter style data formats then they should be invoked explicitly as Greg says. Oscar
Steven D'Aprano <steve@...> writes:
On Thu, Jan 30, 2014 at 11:03:38AM -0800, Larry Hastings wrote:
On Mon, Jan 27, 2014 at 9:41 AM, Wolfgang <wolfgang.maier@... <mailto:wolfgang.maier@...>> wrote:
I think a much cleaner (and probably faster) implementation would be to gather first all the types in the input sequence, then decide what to return in an input order independent way.
I'm willing to consider this a "bug fix". And since it's a new function in 3.4, we don't have an installed base. So I'm willing to consider fixing this for 3.4.
I'm hesitant to require two passes over the data in _sum. Some higher-order statistics like variance are currently implemented using two passes, but ultimately I've like to support single-pass algorithms that can operate on large but finite iterators.
But I will consider it as an option.
I'm also hesitant to make the promise that _sum will be order-independent. Addition in Python isn't:
py> class A(int): ... def __add__(self, other): ... return type(self)(super().__add__(other)) ... def __repr__(self): ... return "%s(%d)" % (type(self).__name__, self) ... py> class B(A): ... pass ... py> A(1) + B(1) A(2) py> B(1) + A(1) B(2)
Hi Steven, first of all let me say that I am quite amazed by the extent of the discussion that is going on now. All I really meant is that there are two special cases (mixed types in _sum and Counters as input to some functions) that I find worth reconsidering in an otherwise really useful module. Regarding your comments above and in other posts: I never proposed two passes over the data. My implementation (below again because many people seem to have missed it in my first rather long post) gathers the input types in a set **while** calculating the sum in a single for loop. It then calls _coerce_types passing this set only once: def _sum2(data, start=None): if start is not None: t = set((type(start),)) n, d = _exact_ratio(start) else: t = set() n = 0 d = 1 partials = {d: n} # map {denominator: sum of numerators} # Micro-optimizations. exact_ratio = _exact_ratio partials_get = partials.get # Add numerators for each denominator, and build up a set of all types. for x in data: t.add(type(x)) n, d = exact_ratio(x) partials[d] = partials_get(d, 0) + n T = _coerce_types(t) # decide which type to use based on set of all types if None in partials: assert issubclass(T, (float, Decimal)) assert not math.isfinite(partials[None]) return T(partials[None]) total = Fraction() for d, n in sorted(partials.items()): total += Fraction(n, d) if issubclass(T, int): assert total.denominator == 1 return T(total.numerator) if issubclass(T, Decimal): return T(total.numerator)/total.denominator return T(total) As for my tentative implementation of _coerce_types, it was really meant as an example. Specifically, I said:
Personally, I'd prefer something as simple as possible, maybe even:
def _coerce_types (types): if len(types) == 1: return next(iter(types)) return float
, but that's just a suggestion.
It is totally up to you to come up with something more along the lines of your original, but I still think that making the behavior order-independent comes at no performance-cost (if not a gain) and will make _sum's return type more predictable. When I said the current behavior was confusing, I didn't mean "not logical" or anything. The current rules are in fact very precisely worked out, I just think they are too complicated to think them through every time. You are right of course with your remark that addition in Python is also order-dependent regarding the returned type, but in my opinion this is not the point here. You are emphasizing that _sum is a private function of the module, but mean is a public one and the behavior of mean is dictated by that of _sum. Now when I call the mean function, then, of course, I know that this will very most likely be implemented as adding all values then dividing by their number, but in terms of encapsulation principles I shouldn't be forced to think about this to understand the return value of the function. In other words, it doesn't help here that _sum reflects the behavior of __add__, all you should care about is that the behavior of mean() is simple to explain and understand. Again, this is just an opinion of somebody interested in having this particular module well-designed from the beginning before things are set in stone. Best wishes, Wolfgang
On Fri, Jan 31, 2014 at 08:57:26AM +0000, Wolfgang Maier wrote:
Steven D'Aprano <steve@...> writes:
I'm hesitant to require two passes over the data in _sum. [...] Regarding your comments above and in other posts: I never proposed two passes over the data. My implementation (below again because many people seem to have missed it in my first rather long post) gathers the input types in a set **while** calculating the sum in a single for loop. It then calls _coerce_types passing this set only once:
Ah! I did miss it -- I just skimmed your implementation, and completely failed to realise the point you were making. Thank you for the correction. -- Steven
On 31/01/2014 08:57, Wolfgang Maier wrote:
Hi Steven, first of all let me say that I am quite amazed by the extent of the discussion that is going on now. All I really meant is that there are two special cases (mixed types in _sum and Counters as input to some functions) that I find worth reconsidering in an otherwise really useful module.
Thanks for starting what I see as a very healthy debate that in the longer term is highly likely to make for a better statistics module. What more could a user ask for? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
participants (7)
-
Greg Ewing
-
Larry Hastings
-
Mark Lawrence
-
Oscar Benjamin
-
Steven D'Aprano
-
Terry Reedy
-
Wolfgang Maier