On Thu, Jan 30, 2014 at 11:03:38AM -0800, Larry Hastings wrote:
> On Mon, Jan 27, 2014 at 9:41 AM, Wolfgang
> <wolfgang.maier(a)biologie.uni-freiburg.de
> <mailto:wolfgang.maier@biologie.uni-freiburg.de>> wrote:
> >I think a much cleaner (and probably faster) implementation would be
> >to gather first all the types in the input sequence, then decide what
> >to return in an input order independent way.
>
> I'm willing to consider this a "bug fix". And since it's a new function
> in 3.4, we don't have an installed base. So I'm willing to consider
> fixing this for 3.4.
I'm hesitant to require two passes over the data in _sum. Some
higher-order statistics like variance are currently implemented using
two passes, but ultimately I've like to support single-pass algorithms
that can operate on large but finite iterators.
But I will consider it as an option.
I'm also hesitant to make the promise that _sum will be
order-independent. Addition in Python isn't:
py> class A(int):
... def __add__(self, other):
... return type(self)(super().__add__(other))
... def __repr__(self):
... return "%s(%d)" % (type(self).__name__, self)
...
py> class B(A):
... pass
...
py> A(1) + B(1)
A(2)
py> B(1) + A(1)
B(2)
[...]
> Yes, exactly. If the support for Counter is half-baked, let's prevent
> it from being used now.
I strongly disagree with this. Counters are currently treated the same
as any other iterable, and built-in sum and math.fsum don't treat them
specially:
py> from collections import Counter
py> c = Counter([1, 1, 1, 1, 1, 2])
py> c
Counter({1: 5, 2: 1})
py> sum(c)
3
py> from math import fsum
py> fsum(c)
3.0
If you're worried about people coming to rely on this, and thus running
into trouble in the future if Counters get treated specially for (say)
weighted data, then I'd accept a warning in the docs, or even a runtime
warning. But not an exception.
--
Steven
Oscar Benjamin <oscar.j.benjamin@...> writes:
Hi Oscar,
and thanks for this very detailed post.
>
> You're making this sound a lot more complicated than it is. The
> problem is simple: Decimal doesn't integrate with the numeric tower.
> This is explicit in the PEP that brought in the numeric tower:
> http://www.python.org/dev/peps/pep-3141/#the-decimal-type
>
You're perfectly right about this as far as built-in number types and the
standard library types Fraction and Decimal are concerned.
> That being said I think that guaranteeing an error is
> better than the current order-dependent behaviour (and agree that that
> should be considered a bug).
>
For custom types, the type returned by _sum can also be order-dependent due
to this part in _coerce-types:
def _coerce_types(T1, T2):
[..]
if issubclass(T2, float): return T2
if issubclass(T1, float): return T1
# Subclasses of the same base class give priority to the second.
if T1.__base__ is T2.__base__: return T2
I chose the more drastic example with Fraction and Decimal for my initial
post because there the difference is between a result and an error, but the
above may illustrate better why I said that the returned type of _sum is
hard to predict.
> If there is to be a more drastic rearrangement of the _sum function
> then it should actually be to solve the problem that the current
> implementation of mean, variance etc. uses Fractions for all the heavy
> lifting but then rounds in the wrong place (when returning from
> _sum()) rather than in the mean, variance function itself.
>
This is an excellent remark and I agree absolutely with your point here.
It's one of the aspects of the statistics module that I pondered over for
weeks.
Essentially, the fact that all current functions that rely on _sum do round
imprecisely anyway was my motivation for suggesting the simple:
def _coerce_types (types):
if len(types) == 1:
return next(iter(types))
return float
because it certainly makes sense to return the type found in the input if
there is only one, but with ambiguity, why make the effort of guessing when
it does not help precision anyway. However, I realized that I probably
rushed this because the implementation of functions that call _sum may
change later to rely on an exact return value.
> The clever algorithm in the variance function (unless it changed since
> I last looked) is entirely unnecessary when all of the intensive
> computation is performed with exact arithmetic. In the absence of
> rounding error you could compute a perfectly good variance using the
> computational formula for variance in a single pass. Similarly
> although the _sum() function is correctly rounded, the mean() function
> calls _sum() and then rounds again so that the return value from
> mean() is rounded twice. _sum() computes an exact value as a fraction
> and then coerces it with
>
> return T(total_numerator) / total_denominator
>
> so that the division causes it to be correctly rounded. However the
> mean function effectively ends up doing
>
> return (T(total_numerator) / total_denominator) / num_items
>
> which uses 2 divisions and hence rounds twice. It's trivial to
> rearrange that so that you round once
>
> return T(total_numerator) / (total_denominator * num_items)
>
> except that to do this the _sum function should be changed to return
> the exact result as a Fraction (and perhaps the type T). Similar
> changes would need to be made to the some of squares function (_ss()
> IIRC). The double rounding in mean() isn't a big deal but the
> corresponding effect for the variance functions is significant. It was
> after realising this that the sum function was renamed _sum and made
> nominally private.
>
I have been thinking about this solution as well, but I think you really
have to return a tuple of the sum as a Fraction and the type (not perhaps)
since it would be really weird if the public functions in statistics always
return a Fraction even if the input sequence consisted of only one standard
type like int, float or Decimal. The obvious criticism then is that such a
_sum is not really a sum function anymore like the existing ones. Then
again, since this is a module private function it may be ok to do this?
Best,
Wolfgang
On 1 February 2014 03:42, Chris Angelico <rosuav(a)gmail.com> wrote:
> I wouldn't withdraw my comment, because I still stand by it. If you
> genuinely meant no specifics, then when someone pointed out how they
> interpreted your statement, you would have apologized and made a
> correction: "I didn't mean anyone in particular, I meant the way
> there've been 50 issues reopened unnecessarily by 30 different people
> lately", or something. But that wouldn't be true, would it? You really
> did mean Anatoly, and that's why you said what you did. Believe you
> me, I know more than you think I do. Think of Emma from "Once Upon A
> Time" if you like - a strong ability to detect lying, based on a
> metric ton of experience with it.
Chris, while Mark's behaviour has been out of line recently, that
isn't anywhere near adequate justification for suggesting (even by
implication) that another list participant is lying about their health
status or their motives. It is impossible to diagnose *anyone*
accurately over the internet - we can only give them the benefit of
the doubt, take their word for it, and judge the outcome by whether
they appear to be making genuine efforts to improve their behaviour,
rather than assuming that everyone is starting from an identical
baseline of expectations and capabilities in relation to civil
discourse (especially once cultural variations are taken into
account).
Mark hasn't been trying to use his diagnosis as a get out of jail free
card - he has been working with other members of the community on his
coping strategies for dealing with mailing list discussions, and
curbing his impulse to respond to poorly thought out ideas with
unconstructive sarcasm.
Now, I suggested to Mark that he consider asking the moderators to set
his moderator flag for the time being, but he has instead chosen to
step away from the core development lists entirely.
While we *do* try to be inclusive of everyone, the thing that *will*
get someone moderated, suspended and perhaps eventually banned
entirely, is a consistent *pattern* of inappropriate behaviour, with
no indication of genuine attempts to eliminate that behaviour (or even
to understand why it is inappropriate).
So if something seems out of line, *please* contact the list
moderators (via python-ideas-owner(a)python.org), rather than
retaliating directly on the list. If replying directly on the list,
please try to assume temporary stress rather than persistent malice or
obstinance on the part of the other poster in the absence of an
extended history of interacting with them.
Regards,
Nick.
--
Nick Coghlan | ncoghlan(a)gmail.com | Brisbane, Australia