Heterogeneous numeric data in statistics library

Users of the statistics module, how often do you use it with heterogeneous data (mixed numeric types)? Currently most of the functions try hard to honour homogeneous data, e.g. if your data is Decimal or Fraction, you will (usually) get Decimal or Fraction results:
With mixed types, the functions usually try to coerce the values into a sensible common type, honouring subclasses:
but that's harder than you might expect and the extra complexity causes some significant performance costs. And not all combinations are supported (Decimal is particularly difficult). If you are a user of statistics, how important to you is the ability to **mix** numeric types, in the same data set? Which combinations do you care about? Would you be satisfied with a rule that said that the statistics functions expect homogeneous data and that the result of calling the functions on mixed types is not guaranteed? -- Steve

Hi Steve Today's XKCD is on 'Selection Bias' and it is set in a statistics conference: https://xkcd.com/2618/ According to its PEP the statistics module provides "common statistics functions such as mean, median, variance and standard deviation". You ask "if you are a user of statistics, how important to you is the ability to **mix** numeric types, in the same data set". Asking your question here introduces a selection bias. To compensate for this, I suggest you also ask a similar question elsewhere, such as comp.lang.python and at some forum for Python in education. Do look at https://xkcd.com/2618/, it makes the point quite nicely. -- Jonathan

IMHO, mixing custom types in this context is usually not required, as long as at least int-to-anything-else typecast is possible. Currently it's done only when there is at least one non-int and when the result can't be represented as int, that is:
If the first example becomes 3.0, I'm not sure if this would be an issue given that division of ints (like 4 / 2) returns a float. But I would be surprised if any example above raised a TypeError. And I would assume the last should work without typecasting anything to float. On the other hand, one issue I commonly have with Pandas is its "always typecast to float" behavior (e.g. when adding a row with an empty cell for the numeric column, it forces it to be NaN and typecast the rest to float unless you explicitly set the column dtype to be "object" or a similar alternative; the rolling/windowing Pandas tools are "hardcoded" to typecast Decimal and other numeric types to float, making it more difficult to avoid floats, etc.). On Thu, 12 May 2022 at 11:51, Jonathan Fine <jfine2358@gmail.com> wrote:
-- Danilo J. S. Bellini

On Fri, 13 May 2022 at 00:20, Steven D'Aprano <steve@pearwood.info> wrote:
I'm only a very small-time user of it, but the only combination I use is int and X, where X is what I'm using in general (eg float, Fraction). As long as int can upcast to everything, I'm fine with it. ChrisA

On 13May2022 00:17, Steven D'Aprano <steve@pearwood.info> wrote:
Users of the statistics module, how often do you use it with heterogeneous data (mixed numeric types)?
Disclaimer: I am not yet a user of the statistics module.
As a general purpose programmer, I would be happy with this. Almost certainly happier than accepting mixed data, because I'd be force to consider what I expect to get _back_ from the functions by supplying a consistent thing _to_ them. My statistics knowledge is not much thicker than a veneer telling how to spell "mean" and "median", but in the general case I'd probably prefer: - a rule like the above requiring homogeneous data - some convenience functions to produce homongenous data from mixed data with docstrings detailing how that is done, possibly slightly broken up if that makes it easy for users to do their own convert-to-homogenous operation I'm also attracted to doing "O(n) convert to homogenous" followed by a _fast_ operation than an accept-heterogeneous-but-be-much-slower. Cheers, Cameron Simpson <cs@cskk.id.au>

Steven D'Aprano writes:
Users of the statistics module, how often do you use it with heterogeneous data (mixed numeric types)?
I don't use it often, but when I do the only coercion I ever need to work is int -> float. If I used anything else, I would convert first anyway on the grounds of EIBTI. That's also what I teach my students (though I don't think I've ever seen a student use anything but int and float). Steve

Hi Steve Today's XKCD is on 'Selection Bias' and it is set in a statistics conference: https://xkcd.com/2618/ According to its PEP the statistics module provides "common statistics functions such as mean, median, variance and standard deviation". You ask "if you are a user of statistics, how important to you is the ability to **mix** numeric types, in the same data set". Asking your question here introduces a selection bias. To compensate for this, I suggest you also ask a similar question elsewhere, such as comp.lang.python and at some forum for Python in education. Do look at https://xkcd.com/2618/, it makes the point quite nicely. -- Jonathan

IMHO, mixing custom types in this context is usually not required, as long as at least int-to-anything-else typecast is possible. Currently it's done only when there is at least one non-int and when the result can't be represented as int, that is:
If the first example becomes 3.0, I'm not sure if this would be an issue given that division of ints (like 4 / 2) returns a float. But I would be surprised if any example above raised a TypeError. And I would assume the last should work without typecasting anything to float. On the other hand, one issue I commonly have with Pandas is its "always typecast to float" behavior (e.g. when adding a row with an empty cell for the numeric column, it forces it to be NaN and typecast the rest to float unless you explicitly set the column dtype to be "object" or a similar alternative; the rolling/windowing Pandas tools are "hardcoded" to typecast Decimal and other numeric types to float, making it more difficult to avoid floats, etc.). On Thu, 12 May 2022 at 11:51, Jonathan Fine <jfine2358@gmail.com> wrote:
-- Danilo J. S. Bellini

On Fri, 13 May 2022 at 00:20, Steven D'Aprano <steve@pearwood.info> wrote:
I'm only a very small-time user of it, but the only combination I use is int and X, where X is what I'm using in general (eg float, Fraction). As long as int can upcast to everything, I'm fine with it. ChrisA

On 13May2022 00:17, Steven D'Aprano <steve@pearwood.info> wrote:
Users of the statistics module, how often do you use it with heterogeneous data (mixed numeric types)?
Disclaimer: I am not yet a user of the statistics module.
As a general purpose programmer, I would be happy with this. Almost certainly happier than accepting mixed data, because I'd be force to consider what I expect to get _back_ from the functions by supplying a consistent thing _to_ them. My statistics knowledge is not much thicker than a veneer telling how to spell "mean" and "median", but in the general case I'd probably prefer: - a rule like the above requiring homogeneous data - some convenience functions to produce homongenous data from mixed data with docstrings detailing how that is done, possibly slightly broken up if that makes it easy for users to do their own convert-to-homogenous operation I'm also attracted to doing "O(n) convert to homogenous" followed by a _fast_ operation than an accept-heterogeneous-but-be-much-slower. Cheers, Cameron Simpson <cs@cskk.id.au>

Steven D'Aprano writes:
Users of the statistics module, how often do you use it with heterogeneous data (mixed numeric types)?
I don't use it often, but when I do the only coercion I ever need to work is int -> float. If I used anything else, I would convert first anyway on the grounds of EIBTI. That's also what I teach my students (though I don't think I've ever seen a student use anything but int and float). Steve
participants (6)
-
Cameron Simpson
-
Chris Angelico
-
Danilo J. S. Bellini
-
Jonathan Fine
-
Stephen J. Turnbull
-
Steven D'Aprano