
Hi all, I have raised a tracker item and PEP for adding a statistics module to the standard library: http://bugs.python.org/issue18606 http://www.python.org/dev/peps/pep-0450/ There has been considerable discussion on python-ideas, which is now reflected by the PEP. I've signed the Contributor Agreement, and submitted a patch containing updated code and tests. The tests aren't yet integrated with the test runner but are runnable manually. Can I request that people please look at this issue, with an aim to ruling on the PEP and (hopefully) adding the module to 3.4 before feature freeze? If it is accepted, I am willing to be primary maintainer for this module in the future. Thanks, -- Steven

The PEP and code look generally good to me. I think the API for median and its variants deserves some wider discussion: the reference implementation has a callable 'median', and variant callables 'median.low', 'median.high', 'median.grouped'. The pattern of attaching the variant callables as attributes on the main callable is unusual, and isn't something I've seen elsewhere in the standard library. I'd like to see some explanation in the PEP for why it's done this way. (There was already some discussion of this on the issue, but that was more centered around the implementation than the API.) I'd propose two alternatives for this: either have separate functions 'median', 'median_low', 'median_high', etc., or have a single function 'median' with a "method" argument that takes a string specifying computation using a particular method. I don't see a really good reason to deviate from standard patterns here, and fear that users would find the current API surprising. Mark On Thu, Aug 15, 2013 at 2:25 AM, Steven D'Aprano <steve@pearwood.info>wrote:

On 15/08/13 21:42, Mark Dickinson wrote:
Alexander Belopolsky has convinced me (off-list) that my current implementation is better changed to a more conservative one of a callable singleton instance with methods implementing the alternative computations. I'll have something like: def _singleton(cls): return cls() @_singleton class median: def __call__(self, data): ... def low(self, data): ... ... In my earlier stats module, I had a single median function that took a argument to choose between alternatives. I called it "scheme": median(data, scheme="low") R uses parameter called "type" to choose between alternate calculations, not for median as we are discussing, but for quantiles: quantile(x, probs ... type = 7, ...). SAS also uses a similar system, but with different numeric codes. I rejected both "type" and "method" as the parameter name since it would cause confusion with the usual meanings of those words. I eventually decided against this system for two reasons: - Each scheme ended up needing to be a separate function, for ease of both implementation and testing. So I had four private median functions, which I put inside a class to act as namespace and avoid polluting the main namespace. Then I needed a "master function" to select which of the methods should be called, with all the additional testing and documentation that entailed. - The API doesn't really feel very Pythonic to me. For example, we write: mystring.rjust(width) dict.items() rather than mystring.justify(width, "right") or dict.iterate("items"). So I think individual methods is a better API, and one which is more familiar to most Python users. The only innovation (if that's what it is) is to have median a callable object. As far as having four separate functions, median, median_low, etc., it just doesn't feel right to me. It puts four slight variations of the same function into the main namespace, instead of keeping them together in a namespace. Names like median_low merely simulates a namespace with pseudo-methods separated with underscores instead of dots, only without the advantages of a real namespace. (I treat variance and std dev differently, and make the sample and population forms separate top-level functions rather than methods, simply because they are so well-known from scientific calculators that it is unthinkable to me to do differently. Whenever I use numpy, I am surprised all over again that it has only a single variance function.) -- Steven

On Thu, Aug 15, 2013 at 6:48 PM, Ryan <rymg19@gmail.com> wrote:
For the naming, how about changing median(callable) to median.regular? That way, we don't have to deal with a callable namespace.
Hmm. That sounds like a step backwards to me: whatever the API is, a simple "from statistics import median; m = median(my_data)" should still work in the simple case. Mark

On Thu, Aug 15, 2013 at 2:08 PM, Steven D'Aprano <steve@pearwood.info>wrote:
That's just an implementation issue, though, and sounds like a minor inconvenience to the implementor rather than anything serious; I don't think that that should dictate the API that's used. - The API doesn't really feel very Pythonic to me. For example, we write:
And I guess this is subjective: conversely, the API you're proposing doesn't feel Pythonic to me. :-) I'd like the hear the opinion of other python-dev readers. Thanks for the detailed replies. Would it be possible to put some of this reasoning into the PEP? Mark

On 08/15/2013 01:58 PM, Mark Dickinson wrote:
I agree with Mark: the proposed median, median.low, etc., doesn't feel right. Is there any example of doing this in the stdlib? I suggest just median(), median_low(), etc. If we do end up keeping it, simpler than the callable singleton is:
Eric.

On Thu, 15 Aug 2013 14:10:39 -0400, "Eric V. Smith" <eric@trueblade.com> wrote:
I too prefer the median_low naming rather than median.low. I'm not sure I can articulate why, but certainly the fact that that latter isn't used anywhere else in the stdlib that I can think of is probably a lot of it :) Perhaps the underlying thought is that we don't use classes pure function namespaces: we expect classes to be something more than that. --David

+1 for the PEP in general from me, but using the underscore based pseudo-namespace for the median variants. The attribute approach isn't *wrong*, just surprising enough that I think independent functions with the "median_" prefix in their name is a better idea. Cheers, Nick.

On 8/15/2013 2:24 PM, R. David Murray wrote:
Actually, there is one place I can think of: itertools.chain.from_iterable. But I think that was a mistake, too. As a recent discussion showed, it's not exactly discoverable. The fact that it's not mentioned in the list of functions at the top of the documentation doesn't help. And "chain" is documented as a "module function", and "chain.from_iterable" as a "classmethod" making it all the more confusing. I think itertools.combinations and itertools.combinations_with_replacement is the better example of related functions that should be followed. Not nested, no special parameters trying to differentiate them: just two different function names. -- Eric.

On 8/15/2013 4:16 PM, Eric V. Smith wrote:
Great implied idea. I opened http://bugs.python.org/issue18752 "Make chain.from_iterable an alias for a new chain_iterable." -- Terry Jan Reedy

On 16/08/13 04:24, R. David Murray wrote:
And the reason it's not used in the stdlib is because whenever somebody proposes doing so, python-dev says "but it's never been used in the stdlib before". *wink*
To be perfectly frank, I agree! Using a class is not my first preference, and I'm suspicious of singletons, but classes and instances are the only flexible namespace type we have short of modules and packages. We have nothing like C++ namespaces. (I have some ideas about that, but they are experimental and utterly not appropriate for first-time use in a std lib module.) Considering how long the namespaces line has been part of the Zen, Python is surprisingly inflexible when it comes to namespaces. There are classes, and modules, and nothing in-between. A separate module for median is too much. There's only four functions. If there were a dozen such functions, I'd push them out into a module, but using a full-blown package structure just for the sake of median is overkill. It is possible to construct a module object on the fly, but I expect that would be even less welcome than a class, and besides, modules aren't callable, which leads to such ugly and error-prone constructions as datetime.datetime and friends. I won't impose "median.median" or "median.regular" on anyone :-) Anyway, this is my last defence of median.low() and friends. If consensus is still against it, I'll use underscores. (I will add a section in the PEP about it, one way or the other.) -- Steven

On 15 Aug 2013, at 21:10, "Eric V. Smith" <eric@trueblade.com> wrote:
There's the patch decorator in unittest.mock which provides: patch(...) patch.object(...) patch.dict(...) The implementation is exactly as you suggest. (e.g. patch.object = _patch_object) Michael
-- http://www.voidspace.org.uk/ May you do good and not evil May you find forgiveness for yourself and forgive others May you share freely, never taking more than you give. -- the sqlite blessing http://www.sqlite.org/different.html

On Thu, 15 Aug 2013 23:28:39 +0300, Michael Foord <fuzzyman@voidspace.org.uk> wrote:
Truthfully there are a number of things about the mock API that make me uncomfortable, including that one. But despite that I'm glad we didn't try to re-engineer it. Take that as you will :) --David

On 16 Aug 2013, at 02:30, R. David Murray <rdmurray@bitdance.com> wrote:
Hah. mock used to provide separate patch and patch_object "functions" (they're really just factory functions for classes) but "patch.object" and "patch.dict" are easy to remember and you only have to import a single object instead of a proliferation. In my experience it's been a better API. The separate function was deprecated and removed a while ago. Other parts of the mock API and architecture are somewhat legacy - it's a six year old project with a lot of users, so it's somewhat inevitable. If starting from scratch I wouldn't do it *very* differently though. Michael
-- http://www.voidspace.org.uk/ May you do good and not evil May you find forgiveness for yourself and forgive others May you share freely, never taking more than you give. -- the sqlite blessing http://www.sqlite.org/different.html

On 16/08/13 04:10, Eric V. Smith wrote:
I agree with Mark: the proposed median, median.low, etc., doesn't feel right. Is there any example of doing this in the stdlib?
The most obvious case is datetime: we have datetime(), and datetime.now(), datetime.today(), and datetime.strftime(). The only API difference between it and median is that datetime is a type and median is not, but that's a difference that makes no difference: both are callables, and being a type is an implementation detail. dict used to be a function that returned a type. Now it is a type. Implementation detail. Even builtins do this: dict() and dict.fromkeys(), for example. If you include unbound methods, nearly every type in Python uses the callable(), callable.method() API. I am truly perplexed by the opposition to the median API. It's a trivially small difference to a pattern you find everywhere.
That is the implementation I currently have. Alexander has convinced me that attaching functions to functions in this way is sub-optimal, because help(median) doesn't notice the attributes, so I'm ruling this implementation out. My preference is to make median a singleton instance with a __call__ method, and the other flavours regular methods. Although I don't like polluting the global namespace with an unnecessary class that will only be instantiated once, if it helps I can do this: class _Median: def __call__(self, data): ... def low(self, data): ... median = _Median() If that standard OOP design is unacceptable, I will swap the dots for underscores, but I won't like it. -- Steven

On 8/15/2013 10:44 PM, Steven D'Aprano wrote:
I and several others, see them as conceptually different in a way that makes a big difference. Datetime is a number structure with distinct properties and operations. The median of a set of values from a totally ordered set is the middle value (if there is an odd number). The median is a function and the result of the function and the type of the result depends on the type of the inputs. The only complication is when there are an even number of items and the middle two cannot be averaged. I presume that is what medium_low is about (pick the lower of the middle two). It is a variant function with a more general definition, not a method of a type. None of the above have anything to do with Python implementations. -- Terry Jan Reedy

On 8/15/2013 10:44 PM, Steven D'Aprano wrote:
Except those classmethods are all alternate constructors for the class of which they're members (it's datetime.strptime, not .strftime). That's a not uncommon idiom. To me, that's a logical difference from the proposed median. I understand it's all just namespaces and callables, but I think the proposed median(), median.low(), etc. just confuse users and makes things less discoverable. I'd expect dir(statistics) to tell me all of the available functions in the module. I wouldn't expect to need to look inside all of the returned functions to see what other functions exist. To see what I mean, look at help(itertools), and see how much harder it is to find chain.from_iterable than it is to find combination_with_replacement. BTW, I'm +1 on adding the statistics module. -- Eric.

On Thu, Aug 15, 2013 at 7:44 PM, Steven D'Aprano <steve@pearwood.info>wrote:
Steven, this is a completely inappropriate comparison. datetime.now(), dict.fromkeys() and others are *factory methods*, also known as alternative constructors. This is a very common idiom in OOP, especially in languages where there is no explicit operator overloading for constructors (and even in those languages, like C++, this idiom is used above some level of complexity). This is totally unlike using a class as a namespace. The latter is unpythonic. If you need a namespace, use a module. If you don't need a namespace, then just use functions. Classes are the wrong tool to express the namespace abstraction in Python. Eli

On Fri, 16 Aug 2013 12:44:54 +1000 Steven D'Aprano <steve@pearwood.info> wrote:
Of course it does. The datetime classmethods return datetime instances, which is why it makes sense to have them classmethods (as opposed to module functions). The median functions, however, don't return median instances.
Using "OOP design" for something which is conceptually not OO (you are just providing callables in the end, not types and objects: your _Median "type" doesn't carry any state) is not really standard in Python. It would be in Java :-) Regards Antoine.

On 15 August 2013 14:08, Steven D'Aprano <steve@pearwood.info> wrote:
Although you're talking about median() above I think that this same reasoning applies to the mode() signature. In the reference implementation it has the signature: def mode(data, max_modes=1): ... The behaviour is that with the default max_modes=1 it will return the unique mode or raise an error if there isn't a unique mode:
You can use the max_modes parameter to specify that more than one mode is acceptable and setting max_modes to 0 or None returns all modes no matter how many. In these cases mode() returns a list:
I can't think of a situation where 1 or 2 modes are acceptable but 3 is not. The only forms I can imagine using are mode(data) to get the unique mode if it exists and mode(data, max_modes=None) to get the set of all modes. But for that usage it would be better to have a boolean flag and then either way you're at the point where it would normally become two functions. Also I dislike changing the return type based on special numeric values:
My preference would be to have two functions, one called e.g. modes() and one called mode(). modes() always returns a list of the most frequent values no matter how many. mode() returns a unique mode if there is one or raises an error. I think that that would be simpler to document and easier to learn and use. If the user is for whatever reason happy with 1 or 2 modes but not 3 then they can call modes() and check for themselves. Also I think that:
Oscar

On 16/08/13 17:47, Oscar Benjamin wrote:
Hmmm, I think you are right. The current design is leftover from when mode also supported continuous data, and it made more sense there.
Alright, you've convinced me. I'll provide two functions: mode, which returns the single value with the highest frequency, or raises; and a second function, which collates the data into a sorted (value, frequency) list. Bike-shedding on the name of this second function is welcomed :-) -- Steven

On Aug 16, 2013 11:05 AM, "Steven D'Aprano" <steve@pearwood.info> wrote:
I'll provide two functions: mode, which returns the single value with the
highest frequency, or raises; and a second function, which collates the data into a sorted (value, frequency) list. Bike-shedding on the name of this second function is welcomed :-) I'd call it counts() and prefer an OrderedDict for easy lookup. By that point you're very close to Counter though (which it currently uses internally). Oscar

On 15/08/13 14:08, Steven D'Aprano wrote:
Horrible.
In my earlier stats module, I had a single median function that took a argument to choose between alternatives. I called it "scheme":
median(data, scheme="low")
What is wrong with this? It's a perfect API; simple and self-explanatory. median is a function in the mathematical sense and it should be a function in Python.
There are other words to choose from ;) "scheme" seems OK to me.
These are methods on objects, the result of these calls depends on the value of 'self' argument, not merely its class. No so with a median singleton. We also have len(seq) and copy.copy(obj) No classes required.

On Thu, Aug 15, 2013 at 2:25 AM, Steven D'Aprano <steve@pearwood.info>wrote:
Bah. I seem to have forgotten how to not top-post. Apologies. Please ignore the previous message, and I'll try again... The PEP and code look generally good to me. I think the API for median and its variants deserves some wider discussion: the reference implementation has a callable 'median', and variant callables 'median.low', 'median.high', 'median.grouped'. The pattern of attaching the variant callables as attributes on the main callable is unusual, and isn't something I've seen elsewhere in the standard library. I'd like to see some explanation in the PEP for why it's done this way. (There was already some discussion of this on the issue, but that was more centered around the implementation than the API.) I'd propose two alternatives for this: either have separate functions 'median', 'median_low', 'median_high', etc., or have a single function 'median' with a "method" argument that takes a string specifying computation using a particular method. I don't see a really good reason to deviate from standard patterns here, and fear that users would find the current API surprising. Mark

On 8/14/2013 9:25 PM, Steven D'Aprano wrote:
I have avoided this discussion, in spite of a decade+ experience as a statistician-programmer, because I am quite busy with Idle testing and there seem to be enough other knowledgeable people around. But I approve of the general idea. I once naively used the shortcut computing formula for variance, present in all too many statistics books, in a program I supplied to a couple of laboratories. After a few months, maybe even a year, of daily use, it crashed trying to take the square root of a negative variance*. Whoops. Fortunately, I was still around to quickly fix it. *As I remember, the three value were something like 10000, 10000, 10001 as single-precision floats. -- Terry Jan Reedy

Hi all, I think that PEP 450 is now ready for a PEP dictator. There have been a number of code reviews, and feedback has been taken into account. The test suite passes. I'm not aware of any unanswered issues with the code. At least two people other than myself think that the implementation is ready for a dictator, and nobody has objected. There is still on-going work on speeding up the implementation for the statistics.sum function, but that will not effect the interface or the substantially change the test suite. http://bugs.python.org/issue18606 http://www.python.org/dev/peps/pep-0450/ -- Steven

Going over the open issues: - Parallel arrays or arrays of tuples? I think the API should require an array of tuples. It is trivial to zip up parallel arrays to the required format, while if you have an array of tuples, extracting the parallel arrays is slightly more cumbersome. Also for manipulating of the raw data, an array of tuples makes it easier to do insertions or removals without worrying about losing the correspondence between the arrays. - Requiring concrete sequences as opposed to iterators sounds fine. I'm guessing that good algorithms for doing certain calculations in a single pass, assuming the full input doesn't fit in memory, are quite different from good algorithms for doing the same calculations without having that worry. (Just like you can't expect to use the same code to do a good job of sorting in-memory and on-disk data.) - Postponing some algorithms to Python 3.5 sounds fine. On Sun, Sep 8, 2013 at 9:06 AM, Ethan Furman <ethan@stoneleaf.us> wrote:
-- --Guido van Rossum (python.org/~guido)

On Sun, Sep 8, 2013 at 5:26 PM, Greg <greg.ewing@canterbury.ac.nz> wrote:
I'd be hesitant to add just that one function, given that there's hardly any support for multi-dimensional arrays in the stdlib. (NumPy of course has a transpose(), and that's where it arguably belongs.) -- --Guido van Rossum (python.org/~guido)

On Mon, Sep 09, 2013 at 12:26:05PM +1200, Greg wrote:
I've intentionally left out multivariate statistics from the initial version of statistics.py so there will be plenty of time to get feedback from users before deciding on an API before 3.5. If there was a transpose function in the std lib, the obvious place would be the statistics module itself. There is precedent: R includes a transpose function, and presumably the creators of R expect it to be used frequently because they've given it a single-letter name. http://stat.ethz.ch/R-manual/R-devel/library/base/html/t.html -- Steven

Never mind, I found the patch and the issue. I really think that the *PEP* is ready for inclusion after the open issues are changed into something like Discussion or Future Work, and after adding a more prominent link to the issue with the patch. Then the *patch* can be reviewed some more until it is ready -- it looks very close already. On Sun, Sep 8, 2013 at 10:32 AM, Guido van Rossum <guido@python.org> wrote:
-- --Guido van Rossum (python.org/~guido)

On Sun, Sep 08, 2013 at 10:51:57AM -0700, Guido van Rossum wrote:
I've updated the PEP as requested. Is there anything further that needs to be done to have it approved? http://www.python.org/dev/peps/pep-0450/ -- Steven

I'm ready to accept this PEP. Because I haven't read this entire thread (and 60 messages about random diversions is really too much to try and catch up on) I'll give people 24 hours to remind me of outstanding rejections. I also haven't reviewed the code in any detail, but I believe the code review is going well, so I'm not concerned that the PEP would have to revised based on that alone. On Fri, Sep 13, 2013 at 5:59 PM, Steven D'Aprano <steve@pearwood.info>wrote:
-- --Guido van Rossum (python.org/~guido)

On 16 September 2013 16:42, Guido van Rossum <guido@python.org> wrote:
I think Steven has addressed all of the issues raised. Briefly from memory: 1) There was concern about having an additional sum function. Steven has pointed out that neither of sum/fsum is accurate for all stdlib numeric types as is the intention for the statistics module. It is not possible to modify either of sum/fsum in a backward compatible way that would make them suitable here. 2) The initial names for the median functions were median.low median.high etc. This naming scheme was considered non-standard by some and has been redesigned as median_low, median_high etc. (there was also discussion about the method used to attach the names to the median function but this became irrelevant after the rename). 3) The mode function also provided an algorithm for estimating the mode of a continuous probability distribution from a sample. It was suggested that there is no uniquely good way of doing this and that it is not commonly needed. This was removed and the API for mode() was simplified (it now returns a unique mode or raises an error). 4) Some of the functions (e.g. variance) used different algorithms (and produced different results) when given an iterator instead of a collection. These are now changed to always use the same algorithm and build a collection internally if necessary. 5) It was suggested that it should also be possible to compute the mean of e.g. timedelta objects but it was pointed out that they can be converted to numbers with the timedelta.total_seconds() method. 6) I raised an issue about the way the sum function behaved for decimals but this was changed in a subsequent patch presenting a new sum function that isn't susceptible to accumulated rounding errors with Decimals. Oscar

On Mon, Sep 16, 2013 at 08:42:12AM -0700, Guido van Rossum wrote:
There are a couple of outstanding issues that I am aware of, but I don't believe that either of these affect acceptance/rejection of the PEP. Please correct me if I am wrong. 1) Implementation details of the statistics.sum function. Oscar is giving me a lot of very valuable assistance speeding up the implementation of sum. 2) The current implementation has extensive docstrings, but will also need a separate statistics.rst file. I don't recall any other outstanding issues, if I have forgotten any, please remind me.
-- Steven

On Mon, Sep 16, 2013 at 4:59 PM, Steven D'Aprano <steve@pearwood.info>wrote:
Those certainly don't stand in the way of the PEP's acceptance (but they do block the commit of the code :-). The issues that Oscar listed also all seem resolved (though they would make a nice addition to the "Discussion" section in the PEP). -- --Guido van Rossum (python.org/~guido)

Congrats, I've accepted the PEP. Nice work! Please work with the reviewers on the issue on the code. (Steven or Oscar, if either of you could work Oscar's list of resolved issues into a patch for the PEP I'll happily update it, just mail it to peps@python.org.) On Mon, Sep 16, 2013 at 5:06 PM, Guido van Rossum <guido@python.org> wrote:
-- --Guido van Rossum (python.org/~guido)

On 18 Sep 2013 08:36, "Ethan Furman" <ethan@stoneleaf.us> wrote:
On 09/17/2013 02:21 PM, Guido van Rossum wrote:
Congrats, I've accepted the PEP. Nice work! Please work with the
reviewers on the issue on the code.
Congratulations, Stephen!
Yay! Cheers, Nick.
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

On 8 September 2013 18:32, Guido van Rossum <guido@python.org> wrote:
For something like this, where there are multiple obvious formats for the input data, I think it's reasonable to just request whatever is convenient for the implementation. Otherwise you're asking at least some of your users to convert data from one format to another just so that you can convert it back again. In any real problem you'll likely have more than two variables, so you'll be writing some code to prepare the data for the function anyway. The most obvious alternative that isn't explicitly mentioned in the PEP is to accept either: def correlation(x, y=None): if y is None: xs = [] ys = [] for x, y in x: xs.append(x) ys.append(y) else: xs = list(x) ys = list(y) assert len(xs) == len(ys) # In reality a helper function does the above. # Now compute stuff This avoids any unnecessary conversions and is as convenient as possible for all users at the expense of having a slightly more complicated API. Oscar

On Sun, Sep 8, 2013 at 1:48 PM, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
Not really. The implementation may change, or its needs may not be obvious to the caller. I would say the right thing to do is request something easy to remember, which often means consistent. In general, Python APIs definitely skew towards lists of tuples rather than parallel arrays, and for good reasons -- that way you benefit most from built-in operations like slices and insert/append.
Yeah, so you might as well prepare it in the form that the API expects.
I don't think this is really more convenient -- it is more to learn, and can cause surprises (e.g. when a user is only familiar with one format and then sees an example using the other format, they may be unable to understand the example). The one argument I *haven't* heard yet which *might* sway me would be something along the line "every other statistics package that users might be familiar with does it this way" or "all the statistics textbooks do it this way". (Because, frankly, when it comes to statistics I'm a rank amateur and I really want Steven's new module to educate me as much as help me compute specific statistical functions.) -- --Guido van Rossum (python.org/~guido)

On Sun, Sep 08, 2013 at 02:41:35PM -0700, Guido van Rossum wrote:
On Sun, Sep 8, 2013 at 1:48 PM, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
The PEP does mention that, as "some combination of the above". The PEP also mentions that the decision of what API to use for multivariate stats is deferred until 3.5, so there's plenty of time for people to bike-shed this :-)
I don't think that there is one common API for multivariate stats packages. It partially depends on whether the package is aimed at basic use or advanced use. I haven't done a systematic comparison of the most common, but here are a few examples: - The Casio Classpad graphing calculator has a spreadsheet-like interface, which I consider equivalent to func(xdata, ydata). - The HP-48G series of calculators uses a fixed global variable holding a matrix, and a second global variable specifying which columns to use. - The R "cor" (correlation coefficient) function takes either a pair of vectors (lists), and calculates a single value, or a matrix, in which case it calculates the correlation matrix. - numpy.corrcoeff takes one or two array arguments, and a third argument specifying whether to treat rows or columns as variables, and like R returns either a single value or the correlation matrix. - Minitab expects two seperate vector arguments, and returns the correlation coefficient between them. - If I'm reading the below page correctly, the SAS corr procedure takes anything up to 27 arguments. http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/proc... I don't suggest we follow that API :-) Quite frankly, I consider the majority of stats APIs to be confusing with a steep learning curve. -- Steven

Guido van Rossum writes:
I don't necessarily find this persuasive. It's more common when working with existing databases that you add variables than add observations. This is going to require attention to the correspondence in any case. Observations aren't added, and they're "removed" temporarily for statistics on subsets by slicing. If you use the same slice for all variables, you're not going to make a mistake.
However, it's common in economic statistics to have a rectangular array, and extract both certain rows (tuples of observations on variables) and certain columns (variables). For example you might have data on populations of American states from 1900 to 2012, and extract the data on New England states from 1946 to 2012 for analysis.
In economic statistics, most software traditionally inputs variables in column-major order (ie, parallel arrays). That said, most software nowadays allows input as spreadsheet tables. You pays your money and you takes your choice. I think the example above of state population data shows that rows and columns are pretty symmetric here. Many databases will have "too many" of both, and you'll want to "slice" both to get the sample and variables relevant to your analysis. This is all just for consideration; I am quite familiar with economic statistics and software, but not so much for that used in sociology, psychology, and medical applications. In the end, I think it's best to leave it up to Steven's judgment as to what is convenient for him to maintain.

Yeah, so this and Steven's review of various other APIs suggests that the field of statistics hasn't really reached the object-oriented age (or perhaps the OO view isn't suitable for the field), and people really think of their data as a matrix of some sort. We should respect that. Now, if this was NumPy, it would *still* make sense to require a single argument, to be interpreted in the usual fashion. So I'm using that as a kind of leverage to still recommend taking a list of pairs instead of a pair of lists. Also, it's quite likely that at least *some* of the users of the new statistics module will be more familiar with OO programming (e.g. the Python DB API , PEP 249) than they are with other statistics packages. On Sun, Sep 8, 2013 at 7:57 PM, Stephen J. Turnbull <stephen@xemacs.org>wrote:
-- --Guido van Rossum (python.org/~guido)

On 9 September 2013 04:16, Guido van Rossum <guido@python.org> wrote:
I'm not sure if I understand what you mean by this. Numpy has built everything on top of a core ndarray class whose methods make the issues about multivariate stats APIs trivial. The transpose of an array A is simply the attribute A.T which is both convenient and cheap since it's just an alternate view on the underlying buffer. Also numpy provides record arrays that enable you to use names instead of numeric indices:
So perhaps the statistics module could have a similar NameTupleArray type that can be easily loaded and saved from a csv file and makes it easy to put your data in whatever form is required. Oscar

On 9/8/2013 10:57 PM, Stephen J. Turnbull wrote:
My experience with general scientific research is the opposite. One decides on the variables to measure and then adds rows (records) of data as you measure each experimental or observational subject. New calculated variables may be added (and often are) after the data collection is complete (at least for the moment). Time series analysis is a distinct and specialized subfield of statistics. The corresponding data collections is often different: one may start with a fixed set of subjects (50 US states for instance) and add 'variables' (population in year X) indefinitely. Much economic statistics is in this category. A third category is interaction analysis, where the data form a true matrix where both rows and columns represent subjects and entries represent interaction (how many times John emailed Joe, for instance). -- Terry Jan Reedy

When Steven first brought up this PEP on comp.lang.python, my main concern was basically, "we have SciPy, why do we need this?" Steven's response, which I have come to accept, is that there are uses for basic statistics for which SciPy's stats module would be overkill. However, once you start slicing your data structure along more than one axis, I think you very quickly will find that you need numpy arrays for performance reasons, at which point you might as go "all the way" and install SciPy. I don't think slicing along multiple dimensions should be a significant concern for this package. Alternatively, I thought there was discussion a long time ago about getting numpy's (or even further back, numeric's?) array type into the core. Python has an array type which I don't think gets a lot of use (or love). Might it be worthwhile to make sure the PEP 450 package works with that? Then extend it to multiple dimensions? Or just bite the bullet and get numpy's array type into the Python core once and for all? Sort of Tulip for arrays... Skip

On 9 Sep 2013 20:46, "Skip Montanaro" <skip@pobox.com> wrote:
which performance
Aka memoryview :) Stefan Krah already fixed most of the multidimensional support issues in 3.3 (including the "cast" method to reinterpret the contents in a different format). The main missing API elements are multidimensional slicing and the ability to export them from types defined in Python. Cheers, Nick.
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

On 9 September 2013 12:56, Nick Coghlan <ncoghlan@gmail.com> wrote:
Being very familiar with numpy's ndarrays and not so much with memoryviews this prompted me to go and have a look at them. How exactly are you supposed to create a multidimensional array using memoryviews? The best I could come up with was something like: $ py -3.3 Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information.
However I don't seem to be able to access the elements:
And the .cast method bails if you try to use a more useful type code:
Oscar

On 9 Sep 2013 22:58, "Oscar Benjamin" <oscar.j.benjamin@gmail.com> wrote:
dimensions? Or the
Oops, forgot the type casting restrictions, too. My main point was that PEP 3118 is already intended as the tulip equivalent for multi-dimensional arrays, and memoryview is the stdlib API for that. It's just incomplete, since most serious multi-dimensional use cases involve skipping memoryview and go straight to NumPy or one of the image libraries. As far as I am aware, there's no opposition to fixing the multi-dimensional support in memoryview *per se*, just the usual concerns about maintainability and a question of volunteers with the time to actually resolve the relevant open issues on the bug tracker. The fairly extensive 3.3 changes focused on fixing stuff that was previously outright broken, but several limitations remain, often because the best API wasn't clear, or because it reached the point where "just use NumPy" seemed like a justifiable answer. Cheers, Nick.
Oscar

On Mon, Sep 09, 2013 at 05:44:43AM -0500, Skip Montanaro wrote:
I agree. I'm not interested in trying to compete with numpy in areas where numpy is best. That's a fight any pure-Python module is going to lose :-)
I haven't tested PEP 450 statistics with numpy array, but any sequence type ought to work. While I haven't done extensive testing on the array.array type, basic testing shows that it works as expected: py> import array py> import statistics py> data = array.array('f', range(1, 101)) py> statistics.mean(data) 50.5 py> statistics.variance(data) 841.6666666666666 -- Steven

On 9/8/2013 5:41 PM, Guido van Rossum wrote:
This question has been discussed in the statistical software community for decades, going back to when storage was on magnetic tape, where contiguity was even more important than cache locality. In my experience with multiple packages, the most common format for input is tables where rows represent cases, samples, or whatever, which translates as lists of records (or tuples), just as with relational databases. Columns then represent a 'variable'. So I think we should go with that. Some packages might transpose the data internally, but that is an internal matter. The tradeoff is that storing by cases makes adding a new case easier, while storing by variables makes adding a new variable easier. -- Terry Jan Reedy

Steven, I'd like to just approve the PEP, given the amount of discussion that's happened already (though I didn't follow much of it). I quickly glanced through the PEP and didn't find anything I'd personally object to, but then I found your section of open issues, and I realized that you don't actually specify the proposed API in the PEP itself. It's highly unusual to approve a PEP that doesn't contain a specification. What did I miss? On Sun, Sep 8, 2013 at 5:37 AM, Steven D'Aprano <steve@pearwood.info> wrote:
-- --Guido van Rossum (python.org/~guido)

On Sun, Sep 08, 2013 at 10:25:22AM -0700, Guido van Rossum wrote:
You didn't miss anything, but I may have. Should the PEP go through each public function in the module (there are only 11)? That may be a little repetitive, since most have the same, or almost the same, signatures. Or is it acceptable to just include an overview? I've come up with this: API The initial version of the library will provide univariate (single variable) statistics functions. The general API will be based on a functional model ``function(data, ...) -> result``, where ``data`` is a mandatory iterable of (usually) numeric data. The author expects that lists will be the most common data type used, but any iterable type should be acceptable. Where necessary, functions may convert to lists internally. Where possible, functions are expected to conserve the type of the data values, for example, the mean of a list of Decimals should be a Decimal rather than float. Calculating the mean, median and mode The ``mean``, ``median`` and ``mode`` functions take a single mandatory argument and return the appropriate statistic, e.g.: >>> mean([1, 2, 3]) 2.0 ``mode`` is the sole exception to the rule that the data argument must be numeric. It will also accept an iterable of nominal data, such as strings. Calculating variance and standard deviation In order to be similar to scientific calculators, the statistics module will include separate functions for population and sample variance and standard deviation. All four functions have similar signatures, with a single mandatory argument, an iterable of numeric data, e.g.: >>> variance([1, 2, 2, 2, 3]) 0.5 All four functions also accept a second, optional, argument, the mean of the data. This is modelled on a similar API provided by the GNU Scientific Library[18]. There are three use-cases for using this argument, in no particular order: 1) The value of the mean is known *a priori*. 2) You have already calculated the mean, and wish to avoid calculating it again. 3) You wish to (ab)use the variance functions to calculate the second moment about some given point other than the mean. In each case, it is the caller's responsibility to ensure that given argument is meaningful. Is this satisfactory or do I need to go into more detail? -- Steven

On 8 September 2013 20:19, Steven D'Aprano <steve@pearwood.info> wrote: [...]
Is this satisfactory or do I need to go into more detail?
It describes only 7 functions, and yet you state there are 11. I'd suggest you add a 1-line summary of each function, something like: mean - calculate the (arithmetic) mean of the data median - calculate the median value of the data etc. Paul

On Sun, Sep 08, 2013 at 09:14:39PM +0100, Paul Moore wrote:
Thanks Paul, will do. I think PEP 1 needs to be a bit clearer about this part of the process. For instance, if I had a module with 100 functions and methods, would I need to document all of them in the PEP? I expect not, but then I didn't expect I needed to document all 11 either :-) -- Steven

The PEP and code look generally good to me. I think the API for median and its variants deserves some wider discussion: the reference implementation has a callable 'median', and variant callables 'median.low', 'median.high', 'median.grouped'. The pattern of attaching the variant callables as attributes on the main callable is unusual, and isn't something I've seen elsewhere in the standard library. I'd like to see some explanation in the PEP for why it's done this way. (There was already some discussion of this on the issue, but that was more centered around the implementation than the API.) I'd propose two alternatives for this: either have separate functions 'median', 'median_low', 'median_high', etc., or have a single function 'median' with a "method" argument that takes a string specifying computation using a particular method. I don't see a really good reason to deviate from standard patterns here, and fear that users would find the current API surprising. Mark On Thu, Aug 15, 2013 at 2:25 AM, Steven D'Aprano <steve@pearwood.info>wrote:

On 15/08/13 21:42, Mark Dickinson wrote:
Alexander Belopolsky has convinced me (off-list) that my current implementation is better changed to a more conservative one of a callable singleton instance with methods implementing the alternative computations. I'll have something like: def _singleton(cls): return cls() @_singleton class median: def __call__(self, data): ... def low(self, data): ... ... In my earlier stats module, I had a single median function that took a argument to choose between alternatives. I called it "scheme": median(data, scheme="low") R uses parameter called "type" to choose between alternate calculations, not for median as we are discussing, but for quantiles: quantile(x, probs ... type = 7, ...). SAS also uses a similar system, but with different numeric codes. I rejected both "type" and "method" as the parameter name since it would cause confusion with the usual meanings of those words. I eventually decided against this system for two reasons: - Each scheme ended up needing to be a separate function, for ease of both implementation and testing. So I had four private median functions, which I put inside a class to act as namespace and avoid polluting the main namespace. Then I needed a "master function" to select which of the methods should be called, with all the additional testing and documentation that entailed. - The API doesn't really feel very Pythonic to me. For example, we write: mystring.rjust(width) dict.items() rather than mystring.justify(width, "right") or dict.iterate("items"). So I think individual methods is a better API, and one which is more familiar to most Python users. The only innovation (if that's what it is) is to have median a callable object. As far as having four separate functions, median, median_low, etc., it just doesn't feel right to me. It puts four slight variations of the same function into the main namespace, instead of keeping them together in a namespace. Names like median_low merely simulates a namespace with pseudo-methods separated with underscores instead of dots, only without the advantages of a real namespace. (I treat variance and std dev differently, and make the sample and population forms separate top-level functions rather than methods, simply because they are so well-known from scientific calculators that it is unthinkable to me to do differently. Whenever I use numpy, I am surprised all over again that it has only a single variance function.) -- Steven

On Thu, Aug 15, 2013 at 6:48 PM, Ryan <rymg19@gmail.com> wrote:
For the naming, how about changing median(callable) to median.regular? That way, we don't have to deal with a callable namespace.
Hmm. That sounds like a step backwards to me: whatever the API is, a simple "from statistics import median; m = median(my_data)" should still work in the simple case. Mark

On Thu, Aug 15, 2013 at 2:08 PM, Steven D'Aprano <steve@pearwood.info>wrote:
That's just an implementation issue, though, and sounds like a minor inconvenience to the implementor rather than anything serious; I don't think that that should dictate the API that's used. - The API doesn't really feel very Pythonic to me. For example, we write:
And I guess this is subjective: conversely, the API you're proposing doesn't feel Pythonic to me. :-) I'd like the hear the opinion of other python-dev readers. Thanks for the detailed replies. Would it be possible to put some of this reasoning into the PEP? Mark

On 08/15/2013 01:58 PM, Mark Dickinson wrote:
I agree with Mark: the proposed median, median.low, etc., doesn't feel right. Is there any example of doing this in the stdlib? I suggest just median(), median_low(), etc. If we do end up keeping it, simpler than the callable singleton is:
Eric.

On Thu, 15 Aug 2013 14:10:39 -0400, "Eric V. Smith" <eric@trueblade.com> wrote:
I too prefer the median_low naming rather than median.low. I'm not sure I can articulate why, but certainly the fact that that latter isn't used anywhere else in the stdlib that I can think of is probably a lot of it :) Perhaps the underlying thought is that we don't use classes pure function namespaces: we expect classes to be something more than that. --David

+1 for the PEP in general from me, but using the underscore based pseudo-namespace for the median variants. The attribute approach isn't *wrong*, just surprising enough that I think independent functions with the "median_" prefix in their name is a better idea. Cheers, Nick.

On 8/15/2013 2:24 PM, R. David Murray wrote:
Actually, there is one place I can think of: itertools.chain.from_iterable. But I think that was a mistake, too. As a recent discussion showed, it's not exactly discoverable. The fact that it's not mentioned in the list of functions at the top of the documentation doesn't help. And "chain" is documented as a "module function", and "chain.from_iterable" as a "classmethod" making it all the more confusing. I think itertools.combinations and itertools.combinations_with_replacement is the better example of related functions that should be followed. Not nested, no special parameters trying to differentiate them: just two different function names. -- Eric.

On 8/15/2013 4:16 PM, Eric V. Smith wrote:
Great implied idea. I opened http://bugs.python.org/issue18752 "Make chain.from_iterable an alias for a new chain_iterable." -- Terry Jan Reedy

On 16/08/13 04:24, R. David Murray wrote:
And the reason it's not used in the stdlib is because whenever somebody proposes doing so, python-dev says "but it's never been used in the stdlib before". *wink*
To be perfectly frank, I agree! Using a class is not my first preference, and I'm suspicious of singletons, but classes and instances are the only flexible namespace type we have short of modules and packages. We have nothing like C++ namespaces. (I have some ideas about that, but they are experimental and utterly not appropriate for first-time use in a std lib module.) Considering how long the namespaces line has been part of the Zen, Python is surprisingly inflexible when it comes to namespaces. There are classes, and modules, and nothing in-between. A separate module for median is too much. There's only four functions. If there were a dozen such functions, I'd push them out into a module, but using a full-blown package structure just for the sake of median is overkill. It is possible to construct a module object on the fly, but I expect that would be even less welcome than a class, and besides, modules aren't callable, which leads to such ugly and error-prone constructions as datetime.datetime and friends. I won't impose "median.median" or "median.regular" on anyone :-) Anyway, this is my last defence of median.low() and friends. If consensus is still against it, I'll use underscores. (I will add a section in the PEP about it, one way or the other.) -- Steven

On 15 Aug 2013, at 21:10, "Eric V. Smith" <eric@trueblade.com> wrote:
There's the patch decorator in unittest.mock which provides: patch(...) patch.object(...) patch.dict(...) The implementation is exactly as you suggest. (e.g. patch.object = _patch_object) Michael
-- http://www.voidspace.org.uk/ May you do good and not evil May you find forgiveness for yourself and forgive others May you share freely, never taking more than you give. -- the sqlite blessing http://www.sqlite.org/different.html

On Thu, 15 Aug 2013 23:28:39 +0300, Michael Foord <fuzzyman@voidspace.org.uk> wrote:
Truthfully there are a number of things about the mock API that make me uncomfortable, including that one. But despite that I'm glad we didn't try to re-engineer it. Take that as you will :) --David

On 16 Aug 2013, at 02:30, R. David Murray <rdmurray@bitdance.com> wrote:
Hah. mock used to provide separate patch and patch_object "functions" (they're really just factory functions for classes) but "patch.object" and "patch.dict" are easy to remember and you only have to import a single object instead of a proliferation. In my experience it's been a better API. The separate function was deprecated and removed a while ago. Other parts of the mock API and architecture are somewhat legacy - it's a six year old project with a lot of users, so it's somewhat inevitable. If starting from scratch I wouldn't do it *very* differently though. Michael
-- http://www.voidspace.org.uk/ May you do good and not evil May you find forgiveness for yourself and forgive others May you share freely, never taking more than you give. -- the sqlite blessing http://www.sqlite.org/different.html

On 16/08/13 04:10, Eric V. Smith wrote:
I agree with Mark: the proposed median, median.low, etc., doesn't feel right. Is there any example of doing this in the stdlib?
The most obvious case is datetime: we have datetime(), and datetime.now(), datetime.today(), and datetime.strftime(). The only API difference between it and median is that datetime is a type and median is not, but that's a difference that makes no difference: both are callables, and being a type is an implementation detail. dict used to be a function that returned a type. Now it is a type. Implementation detail. Even builtins do this: dict() and dict.fromkeys(), for example. If you include unbound methods, nearly every type in Python uses the callable(), callable.method() API. I am truly perplexed by the opposition to the median API. It's a trivially small difference to a pattern you find everywhere.
That is the implementation I currently have. Alexander has convinced me that attaching functions to functions in this way is sub-optimal, because help(median) doesn't notice the attributes, so I'm ruling this implementation out. My preference is to make median a singleton instance with a __call__ method, and the other flavours regular methods. Although I don't like polluting the global namespace with an unnecessary class that will only be instantiated once, if it helps I can do this: class _Median: def __call__(self, data): ... def low(self, data): ... median = _Median() If that standard OOP design is unacceptable, I will swap the dots for underscores, but I won't like it. -- Steven

On 8/15/2013 10:44 PM, Steven D'Aprano wrote:
I and several others, see them as conceptually different in a way that makes a big difference. Datetime is a number structure with distinct properties and operations. The median of a set of values from a totally ordered set is the middle value (if there is an odd number). The median is a function and the result of the function and the type of the result depends on the type of the inputs. The only complication is when there are an even number of items and the middle two cannot be averaged. I presume that is what medium_low is about (pick the lower of the middle two). It is a variant function with a more general definition, not a method of a type. None of the above have anything to do with Python implementations. -- Terry Jan Reedy

On 8/15/2013 10:44 PM, Steven D'Aprano wrote:
Except those classmethods are all alternate constructors for the class of which they're members (it's datetime.strptime, not .strftime). That's a not uncommon idiom. To me, that's a logical difference from the proposed median. I understand it's all just namespaces and callables, but I think the proposed median(), median.low(), etc. just confuse users and makes things less discoverable. I'd expect dir(statistics) to tell me all of the available functions in the module. I wouldn't expect to need to look inside all of the returned functions to see what other functions exist. To see what I mean, look at help(itertools), and see how much harder it is to find chain.from_iterable than it is to find combination_with_replacement. BTW, I'm +1 on adding the statistics module. -- Eric.

On Thu, Aug 15, 2013 at 7:44 PM, Steven D'Aprano <steve@pearwood.info>wrote:
Steven, this is a completely inappropriate comparison. datetime.now(), dict.fromkeys() and others are *factory methods*, also known as alternative constructors. This is a very common idiom in OOP, especially in languages where there is no explicit operator overloading for constructors (and even in those languages, like C++, this idiom is used above some level of complexity). This is totally unlike using a class as a namespace. The latter is unpythonic. If you need a namespace, use a module. If you don't need a namespace, then just use functions. Classes are the wrong tool to express the namespace abstraction in Python. Eli

On Fri, 16 Aug 2013 12:44:54 +1000 Steven D'Aprano <steve@pearwood.info> wrote:
Of course it does. The datetime classmethods return datetime instances, which is why it makes sense to have them classmethods (as opposed to module functions). The median functions, however, don't return median instances.
Using "OOP design" for something which is conceptually not OO (you are just providing callables in the end, not types and objects: your _Median "type" doesn't carry any state) is not really standard in Python. It would be in Java :-) Regards Antoine.

On 15 August 2013 14:08, Steven D'Aprano <steve@pearwood.info> wrote:
Although you're talking about median() above I think that this same reasoning applies to the mode() signature. In the reference implementation it has the signature: def mode(data, max_modes=1): ... The behaviour is that with the default max_modes=1 it will return the unique mode or raise an error if there isn't a unique mode:
You can use the max_modes parameter to specify that more than one mode is acceptable and setting max_modes to 0 or None returns all modes no matter how many. In these cases mode() returns a list:
I can't think of a situation where 1 or 2 modes are acceptable but 3 is not. The only forms I can imagine using are mode(data) to get the unique mode if it exists and mode(data, max_modes=None) to get the set of all modes. But for that usage it would be better to have a boolean flag and then either way you're at the point where it would normally become two functions. Also I dislike changing the return type based on special numeric values:
My preference would be to have two functions, one called e.g. modes() and one called mode(). modes() always returns a list of the most frequent values no matter how many. mode() returns a unique mode if there is one or raises an error. I think that that would be simpler to document and easier to learn and use. If the user is for whatever reason happy with 1 or 2 modes but not 3 then they can call modes() and check for themselves. Also I think that:
Oscar

On 16/08/13 17:47, Oscar Benjamin wrote:
Hmmm, I think you are right. The current design is leftover from when mode also supported continuous data, and it made more sense there.
Alright, you've convinced me. I'll provide two functions: mode, which returns the single value with the highest frequency, or raises; and a second function, which collates the data into a sorted (value, frequency) list. Bike-shedding on the name of this second function is welcomed :-) -- Steven

On Aug 16, 2013 11:05 AM, "Steven D'Aprano" <steve@pearwood.info> wrote:
I'll provide two functions: mode, which returns the single value with the
highest frequency, or raises; and a second function, which collates the data into a sorted (value, frequency) list. Bike-shedding on the name of this second function is welcomed :-) I'd call it counts() and prefer an OrderedDict for easy lookup. By that point you're very close to Counter though (which it currently uses internally). Oscar

On 15/08/13 14:08, Steven D'Aprano wrote:
Horrible.
In my earlier stats module, I had a single median function that took a argument to choose between alternatives. I called it "scheme":
median(data, scheme="low")
What is wrong with this? It's a perfect API; simple and self-explanatory. median is a function in the mathematical sense and it should be a function in Python.
There are other words to choose from ;) "scheme" seems OK to me.
These are methods on objects, the result of these calls depends on the value of 'self' argument, not merely its class. No so with a median singleton. We also have len(seq) and copy.copy(obj) No classes required.

On Thu, Aug 15, 2013 at 2:25 AM, Steven D'Aprano <steve@pearwood.info>wrote:
Bah. I seem to have forgotten how to not top-post. Apologies. Please ignore the previous message, and I'll try again... The PEP and code look generally good to me. I think the API for median and its variants deserves some wider discussion: the reference implementation has a callable 'median', and variant callables 'median.low', 'median.high', 'median.grouped'. The pattern of attaching the variant callables as attributes on the main callable is unusual, and isn't something I've seen elsewhere in the standard library. I'd like to see some explanation in the PEP for why it's done this way. (There was already some discussion of this on the issue, but that was more centered around the implementation than the API.) I'd propose two alternatives for this: either have separate functions 'median', 'median_low', 'median_high', etc., or have a single function 'median' with a "method" argument that takes a string specifying computation using a particular method. I don't see a really good reason to deviate from standard patterns here, and fear that users would find the current API surprising. Mark

On 8/14/2013 9:25 PM, Steven D'Aprano wrote:
I have avoided this discussion, in spite of a decade+ experience as a statistician-programmer, because I am quite busy with Idle testing and there seem to be enough other knowledgeable people around. But I approve of the general idea. I once naively used the shortcut computing formula for variance, present in all too many statistics books, in a program I supplied to a couple of laboratories. After a few months, maybe even a year, of daily use, it crashed trying to take the square root of a negative variance*. Whoops. Fortunately, I was still around to quickly fix it. *As I remember, the three value were something like 10000, 10000, 10001 as single-precision floats. -- Terry Jan Reedy

Hi all, I think that PEP 450 is now ready for a PEP dictator. There have been a number of code reviews, and feedback has been taken into account. The test suite passes. I'm not aware of any unanswered issues with the code. At least two people other than myself think that the implementation is ready for a dictator, and nobody has objected. There is still on-going work on speeding up the implementation for the statistics.sum function, but that will not effect the interface or the substantially change the test suite. http://bugs.python.org/issue18606 http://www.python.org/dev/peps/pep-0450/ -- Steven

Going over the open issues: - Parallel arrays or arrays of tuples? I think the API should require an array of tuples. It is trivial to zip up parallel arrays to the required format, while if you have an array of tuples, extracting the parallel arrays is slightly more cumbersome. Also for manipulating of the raw data, an array of tuples makes it easier to do insertions or removals without worrying about losing the correspondence between the arrays. - Requiring concrete sequences as opposed to iterators sounds fine. I'm guessing that good algorithms for doing certain calculations in a single pass, assuming the full input doesn't fit in memory, are quite different from good algorithms for doing the same calculations without having that worry. (Just like you can't expect to use the same code to do a good job of sorting in-memory and on-disk data.) - Postponing some algorithms to Python 3.5 sounds fine. On Sun, Sep 8, 2013 at 9:06 AM, Ethan Furman <ethan@stoneleaf.us> wrote:
-- --Guido van Rossum (python.org/~guido)

On Sun, Sep 8, 2013 at 5:26 PM, Greg <greg.ewing@canterbury.ac.nz> wrote:
I'd be hesitant to add just that one function, given that there's hardly any support for multi-dimensional arrays in the stdlib. (NumPy of course has a transpose(), and that's where it arguably belongs.) -- --Guido van Rossum (python.org/~guido)

On Mon, Sep 09, 2013 at 12:26:05PM +1200, Greg wrote:
I've intentionally left out multivariate statistics from the initial version of statistics.py so there will be plenty of time to get feedback from users before deciding on an API before 3.5. If there was a transpose function in the std lib, the obvious place would be the statistics module itself. There is precedent: R includes a transpose function, and presumably the creators of R expect it to be used frequently because they've given it a single-letter name. http://stat.ethz.ch/R-manual/R-devel/library/base/html/t.html -- Steven

Never mind, I found the patch and the issue. I really think that the *PEP* is ready for inclusion after the open issues are changed into something like Discussion or Future Work, and after adding a more prominent link to the issue with the patch. Then the *patch* can be reviewed some more until it is ready -- it looks very close already. On Sun, Sep 8, 2013 at 10:32 AM, Guido van Rossum <guido@python.org> wrote:
-- --Guido van Rossum (python.org/~guido)

On Sun, Sep 08, 2013 at 10:51:57AM -0700, Guido van Rossum wrote:
I've updated the PEP as requested. Is there anything further that needs to be done to have it approved? http://www.python.org/dev/peps/pep-0450/ -- Steven

I'm ready to accept this PEP. Because I haven't read this entire thread (and 60 messages about random diversions is really too much to try and catch up on) I'll give people 24 hours to remind me of outstanding rejections. I also haven't reviewed the code in any detail, but I believe the code review is going well, so I'm not concerned that the PEP would have to revised based on that alone. On Fri, Sep 13, 2013 at 5:59 PM, Steven D'Aprano <steve@pearwood.info>wrote:
-- --Guido van Rossum (python.org/~guido)

On 16 September 2013 16:42, Guido van Rossum <guido@python.org> wrote:
I think Steven has addressed all of the issues raised. Briefly from memory: 1) There was concern about having an additional sum function. Steven has pointed out that neither of sum/fsum is accurate for all stdlib numeric types as is the intention for the statistics module. It is not possible to modify either of sum/fsum in a backward compatible way that would make them suitable here. 2) The initial names for the median functions were median.low median.high etc. This naming scheme was considered non-standard by some and has been redesigned as median_low, median_high etc. (there was also discussion about the method used to attach the names to the median function but this became irrelevant after the rename). 3) The mode function also provided an algorithm for estimating the mode of a continuous probability distribution from a sample. It was suggested that there is no uniquely good way of doing this and that it is not commonly needed. This was removed and the API for mode() was simplified (it now returns a unique mode or raises an error). 4) Some of the functions (e.g. variance) used different algorithms (and produced different results) when given an iterator instead of a collection. These are now changed to always use the same algorithm and build a collection internally if necessary. 5) It was suggested that it should also be possible to compute the mean of e.g. timedelta objects but it was pointed out that they can be converted to numbers with the timedelta.total_seconds() method. 6) I raised an issue about the way the sum function behaved for decimals but this was changed in a subsequent patch presenting a new sum function that isn't susceptible to accumulated rounding errors with Decimals. Oscar

On Mon, Sep 16, 2013 at 08:42:12AM -0700, Guido van Rossum wrote:
There are a couple of outstanding issues that I am aware of, but I don't believe that either of these affect acceptance/rejection of the PEP. Please correct me if I am wrong. 1) Implementation details of the statistics.sum function. Oscar is giving me a lot of very valuable assistance speeding up the implementation of sum. 2) The current implementation has extensive docstrings, but will also need a separate statistics.rst file. I don't recall any other outstanding issues, if I have forgotten any, please remind me.
-- Steven

On Mon, Sep 16, 2013 at 4:59 PM, Steven D'Aprano <steve@pearwood.info>wrote:
Those certainly don't stand in the way of the PEP's acceptance (but they do block the commit of the code :-). The issues that Oscar listed also all seem resolved (though they would make a nice addition to the "Discussion" section in the PEP). -- --Guido van Rossum (python.org/~guido)

Congrats, I've accepted the PEP. Nice work! Please work with the reviewers on the issue on the code. (Steven or Oscar, if either of you could work Oscar's list of resolved issues into a patch for the PEP I'll happily update it, just mail it to peps@python.org.) On Mon, Sep 16, 2013 at 5:06 PM, Guido van Rossum <guido@python.org> wrote:
-- --Guido van Rossum (python.org/~guido)

On 18 Sep 2013 08:36, "Ethan Furman" <ethan@stoneleaf.us> wrote:
On 09/17/2013 02:21 PM, Guido van Rossum wrote:
Congrats, I've accepted the PEP. Nice work! Please work with the
reviewers on the issue on the code.
Congratulations, Stephen!
Yay! Cheers, Nick.
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

On 8 September 2013 18:32, Guido van Rossum <guido@python.org> wrote:
For something like this, where there are multiple obvious formats for the input data, I think it's reasonable to just request whatever is convenient for the implementation. Otherwise you're asking at least some of your users to convert data from one format to another just so that you can convert it back again. In any real problem you'll likely have more than two variables, so you'll be writing some code to prepare the data for the function anyway. The most obvious alternative that isn't explicitly mentioned in the PEP is to accept either: def correlation(x, y=None): if y is None: xs = [] ys = [] for x, y in x: xs.append(x) ys.append(y) else: xs = list(x) ys = list(y) assert len(xs) == len(ys) # In reality a helper function does the above. # Now compute stuff This avoids any unnecessary conversions and is as convenient as possible for all users at the expense of having a slightly more complicated API. Oscar

On Sun, Sep 8, 2013 at 1:48 PM, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
Not really. The implementation may change, or its needs may not be obvious to the caller. I would say the right thing to do is request something easy to remember, which often means consistent. In general, Python APIs definitely skew towards lists of tuples rather than parallel arrays, and for good reasons -- that way you benefit most from built-in operations like slices and insert/append.
Yeah, so you might as well prepare it in the form that the API expects.
I don't think this is really more convenient -- it is more to learn, and can cause surprises (e.g. when a user is only familiar with one format and then sees an example using the other format, they may be unable to understand the example). The one argument I *haven't* heard yet which *might* sway me would be something along the line "every other statistics package that users might be familiar with does it this way" or "all the statistics textbooks do it this way". (Because, frankly, when it comes to statistics I'm a rank amateur and I really want Steven's new module to educate me as much as help me compute specific statistical functions.) -- --Guido van Rossum (python.org/~guido)

On Sun, Sep 08, 2013 at 02:41:35PM -0700, Guido van Rossum wrote:
On Sun, Sep 8, 2013 at 1:48 PM, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
The PEP does mention that, as "some combination of the above". The PEP also mentions that the decision of what API to use for multivariate stats is deferred until 3.5, so there's plenty of time for people to bike-shed this :-)
I don't think that there is one common API for multivariate stats packages. It partially depends on whether the package is aimed at basic use or advanced use. I haven't done a systematic comparison of the most common, but here are a few examples: - The Casio Classpad graphing calculator has a spreadsheet-like interface, which I consider equivalent to func(xdata, ydata). - The HP-48G series of calculators uses a fixed global variable holding a matrix, and a second global variable specifying which columns to use. - The R "cor" (correlation coefficient) function takes either a pair of vectors (lists), and calculates a single value, or a matrix, in which case it calculates the correlation matrix. - numpy.corrcoeff takes one or two array arguments, and a third argument specifying whether to treat rows or columns as variables, and like R returns either a single value or the correlation matrix. - Minitab expects two seperate vector arguments, and returns the correlation coefficient between them. - If I'm reading the below page correctly, the SAS corr procedure takes anything up to 27 arguments. http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/proc... I don't suggest we follow that API :-) Quite frankly, I consider the majority of stats APIs to be confusing with a steep learning curve. -- Steven

Guido van Rossum writes:
I don't necessarily find this persuasive. It's more common when working with existing databases that you add variables than add observations. This is going to require attention to the correspondence in any case. Observations aren't added, and they're "removed" temporarily for statistics on subsets by slicing. If you use the same slice for all variables, you're not going to make a mistake.
However, it's common in economic statistics to have a rectangular array, and extract both certain rows (tuples of observations on variables) and certain columns (variables). For example you might have data on populations of American states from 1900 to 2012, and extract the data on New England states from 1946 to 2012 for analysis.
In economic statistics, most software traditionally inputs variables in column-major order (ie, parallel arrays). That said, most software nowadays allows input as spreadsheet tables. You pays your money and you takes your choice. I think the example above of state population data shows that rows and columns are pretty symmetric here. Many databases will have "too many" of both, and you'll want to "slice" both to get the sample and variables relevant to your analysis. This is all just for consideration; I am quite familiar with economic statistics and software, but not so much for that used in sociology, psychology, and medical applications. In the end, I think it's best to leave it up to Steven's judgment as to what is convenient for him to maintain.

Yeah, so this and Steven's review of various other APIs suggests that the field of statistics hasn't really reached the object-oriented age (or perhaps the OO view isn't suitable for the field), and people really think of their data as a matrix of some sort. We should respect that. Now, if this was NumPy, it would *still* make sense to require a single argument, to be interpreted in the usual fashion. So I'm using that as a kind of leverage to still recommend taking a list of pairs instead of a pair of lists. Also, it's quite likely that at least *some* of the users of the new statistics module will be more familiar with OO programming (e.g. the Python DB API , PEP 249) than they are with other statistics packages. On Sun, Sep 8, 2013 at 7:57 PM, Stephen J. Turnbull <stephen@xemacs.org>wrote:
-- --Guido van Rossum (python.org/~guido)

On 9 September 2013 04:16, Guido van Rossum <guido@python.org> wrote:
I'm not sure if I understand what you mean by this. Numpy has built everything on top of a core ndarray class whose methods make the issues about multivariate stats APIs trivial. The transpose of an array A is simply the attribute A.T which is both convenient and cheap since it's just an alternate view on the underlying buffer. Also numpy provides record arrays that enable you to use names instead of numeric indices:
So perhaps the statistics module could have a similar NameTupleArray type that can be easily loaded and saved from a csv file and makes it easy to put your data in whatever form is required. Oscar

On 9/8/2013 10:57 PM, Stephen J. Turnbull wrote:
My experience with general scientific research is the opposite. One decides on the variables to measure and then adds rows (records) of data as you measure each experimental or observational subject. New calculated variables may be added (and often are) after the data collection is complete (at least for the moment). Time series analysis is a distinct and specialized subfield of statistics. The corresponding data collections is often different: one may start with a fixed set of subjects (50 US states for instance) and add 'variables' (population in year X) indefinitely. Much economic statistics is in this category. A third category is interaction analysis, where the data form a true matrix where both rows and columns represent subjects and entries represent interaction (how many times John emailed Joe, for instance). -- Terry Jan Reedy

When Steven first brought up this PEP on comp.lang.python, my main concern was basically, "we have SciPy, why do we need this?" Steven's response, which I have come to accept, is that there are uses for basic statistics for which SciPy's stats module would be overkill. However, once you start slicing your data structure along more than one axis, I think you very quickly will find that you need numpy arrays for performance reasons, at which point you might as go "all the way" and install SciPy. I don't think slicing along multiple dimensions should be a significant concern for this package. Alternatively, I thought there was discussion a long time ago about getting numpy's (or even further back, numeric's?) array type into the core. Python has an array type which I don't think gets a lot of use (or love). Might it be worthwhile to make sure the PEP 450 package works with that? Then extend it to multiple dimensions? Or just bite the bullet and get numpy's array type into the Python core once and for all? Sort of Tulip for arrays... Skip

On 9 Sep 2013 20:46, "Skip Montanaro" <skip@pobox.com> wrote:
which performance
Aka memoryview :) Stefan Krah already fixed most of the multidimensional support issues in 3.3 (including the "cast" method to reinterpret the contents in a different format). The main missing API elements are multidimensional slicing and the ability to export them from types defined in Python. Cheers, Nick.
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

On 9 September 2013 12:56, Nick Coghlan <ncoghlan@gmail.com> wrote:
Being very familiar with numpy's ndarrays and not so much with memoryviews this prompted me to go and have a look at them. How exactly are you supposed to create a multidimensional array using memoryviews? The best I could come up with was something like: $ py -3.3 Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information.
However I don't seem to be able to access the elements:
And the .cast method bails if you try to use a more useful type code:
Oscar

On 9 Sep 2013 22:58, "Oscar Benjamin" <oscar.j.benjamin@gmail.com> wrote:
dimensions? Or the
Oops, forgot the type casting restrictions, too. My main point was that PEP 3118 is already intended as the tulip equivalent for multi-dimensional arrays, and memoryview is the stdlib API for that. It's just incomplete, since most serious multi-dimensional use cases involve skipping memoryview and go straight to NumPy or one of the image libraries. As far as I am aware, there's no opposition to fixing the multi-dimensional support in memoryview *per se*, just the usual concerns about maintainability and a question of volunteers with the time to actually resolve the relevant open issues on the bug tracker. The fairly extensive 3.3 changes focused on fixing stuff that was previously outright broken, but several limitations remain, often because the best API wasn't clear, or because it reached the point where "just use NumPy" seemed like a justifiable answer. Cheers, Nick.
Oscar

On Mon, Sep 09, 2013 at 05:44:43AM -0500, Skip Montanaro wrote:
I agree. I'm not interested in trying to compete with numpy in areas where numpy is best. That's a fight any pure-Python module is going to lose :-)
I haven't tested PEP 450 statistics with numpy array, but any sequence type ought to work. While I haven't done extensive testing on the array.array type, basic testing shows that it works as expected: py> import array py> import statistics py> data = array.array('f', range(1, 101)) py> statistics.mean(data) 50.5 py> statistics.variance(data) 841.6666666666666 -- Steven

On 9/8/2013 5:41 PM, Guido van Rossum wrote:
This question has been discussed in the statistical software community for decades, going back to when storage was on magnetic tape, where contiguity was even more important than cache locality. In my experience with multiple packages, the most common format for input is tables where rows represent cases, samples, or whatever, which translates as lists of records (or tuples), just as with relational databases. Columns then represent a 'variable'. So I think we should go with that. Some packages might transpose the data internally, but that is an internal matter. The tradeoff is that storing by cases makes adding a new case easier, while storing by variables makes adding a new variable easier. -- Terry Jan Reedy

Steven, I'd like to just approve the PEP, given the amount of discussion that's happened already (though I didn't follow much of it). I quickly glanced through the PEP and didn't find anything I'd personally object to, but then I found your section of open issues, and I realized that you don't actually specify the proposed API in the PEP itself. It's highly unusual to approve a PEP that doesn't contain a specification. What did I miss? On Sun, Sep 8, 2013 at 5:37 AM, Steven D'Aprano <steve@pearwood.info> wrote:
-- --Guido van Rossum (python.org/~guido)

On Sun, Sep 08, 2013 at 10:25:22AM -0700, Guido van Rossum wrote:
You didn't miss anything, but I may have. Should the PEP go through each public function in the module (there are only 11)? That may be a little repetitive, since most have the same, or almost the same, signatures. Or is it acceptable to just include an overview? I've come up with this: API The initial version of the library will provide univariate (single variable) statistics functions. The general API will be based on a functional model ``function(data, ...) -> result``, where ``data`` is a mandatory iterable of (usually) numeric data. The author expects that lists will be the most common data type used, but any iterable type should be acceptable. Where necessary, functions may convert to lists internally. Where possible, functions are expected to conserve the type of the data values, for example, the mean of a list of Decimals should be a Decimal rather than float. Calculating the mean, median and mode The ``mean``, ``median`` and ``mode`` functions take a single mandatory argument and return the appropriate statistic, e.g.: >>> mean([1, 2, 3]) 2.0 ``mode`` is the sole exception to the rule that the data argument must be numeric. It will also accept an iterable of nominal data, such as strings. Calculating variance and standard deviation In order to be similar to scientific calculators, the statistics module will include separate functions for population and sample variance and standard deviation. All four functions have similar signatures, with a single mandatory argument, an iterable of numeric data, e.g.: >>> variance([1, 2, 2, 2, 3]) 0.5 All four functions also accept a second, optional, argument, the mean of the data. This is modelled on a similar API provided by the GNU Scientific Library[18]. There are three use-cases for using this argument, in no particular order: 1) The value of the mean is known *a priori*. 2) You have already calculated the mean, and wish to avoid calculating it again. 3) You wish to (ab)use the variance functions to calculate the second moment about some given point other than the mean. In each case, it is the caller's responsibility to ensure that given argument is meaningful. Is this satisfactory or do I need to go into more detail? -- Steven

On 8 September 2013 20:19, Steven D'Aprano <steve@pearwood.info> wrote: [...]
Is this satisfactory or do I need to go into more detail?
It describes only 7 functions, and yet you state there are 11. I'd suggest you add a 1-line summary of each function, something like: mean - calculate the (arithmetic) mean of the data median - calculate the median value of the data etc. Paul

On Sun, Sep 08, 2013 at 09:14:39PM +0100, Paul Moore wrote:
Thanks Paul, will do. I think PEP 1 needs to be a bit clearer about this part of the process. For instance, if I had a module with 100 functions and methods, would I need to document all of them in the PEP? I expect not, but then I didn't expect I needed to document all 11 either :-) -- Steven
participants (21)
-
Alexander Belopolsky
-
Antoine Pitrou
-
Eli Bendersky
-
Eric V. Smith
-
Ethan Furman
-
Greg
-
Guido van Rossum
-
Mark Dickinson
-
Mark Shannon
-
Michael Foord
-
Nick Coghlan
-
Oscar Benjamin
-
Paul Colomiets
-
Paul Moore
-
R. David Murray
-
Ryan
-
Serhiy Storchaka
-
Skip Montanaro
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy