Running average and stdev in the statistics module?

Hi here, I wonder if the idea of adding to the statistics module a class to calculate the running statistics (average and standard deviation) of a generic input data stream has ever come up in the past. The basic idea is to do the necessary book-keeping as the data are fed into the accumulator class and to be able to query the average variance of the sequence at any point in time without having to loop over the thing again. The obvious way to do that is well know, and described, e.g., in Knuth TAOCP vol 2, 3rd edition, page 232. FWIW It is something that through the years I have coded myself a myriad of times (e.g., for real-time data processing)---and maybe worth considering for addition to the standard library. For completeness, a cursory look on google brings up this fairly nice package https://pypi.org/project/runstats/ but really, the core algorithm would be trivial to code in a fashion that works with decimal and fraction objects to be integrated into the statistics module. Should this spur enough interest (and assuming that the maintainer(s) of the module are not hostile to the idea) I'd like to volunteer to put together an tentative implementation. [It's my first post on this list, so please be gentle :-)] Luca -- =============================================================================== Luca Baldini Universita' di Pisa and Istituto Nazionale di Fisica Nucleare - Sezione di Pisa Largo Bruno Pontecorvo 3, I-56127, Pisa, ITALY. phone : +39 050 2214438 fax : +39 050 2214317 e-mail : luca.baldini@pi.infn.it icq : 396247302 (Garrone) web : http://www.df.unipi.it/~baldini mirror : http://www.pi.infn.it/~lbaldini ===============================================================================

On Sun, May 5, 2019 at 1:08 PM Luca Baldini <luca.baldini@pi.infn.it> wrote:
Personally, I would definitely use this in a number of places in the real-life code I contribute to. The problem that I have with this idea is it's not clear how to store the data in an accumulator class. What about cases with different contexts in asyncio and/or multithreading code? I would say it could be useful to allow to pass a storage implementation from a user's code to address almost any possible scenario. In that case, such an accumulator class doesn't need to be a class at all and bother with any intermediate storage. It could be a number of module-level functions providing an effective algorythm implementation for user to be able to base on.

I've often wanted a windowing function in itertools. One exists as a recipe in the docs. If I remember correctly, one reason this was never implemented is that the most efficient implementation changes depending on the size of the window. Use a deque(maxsize=n) for large windows and tuple slicing/concat for tiny windows. I'm not sure how the tee/zip trick compares. On Mon, May 6, 2019, 10:11 AM Serge Matveenko <s@matveenko.ru> wrote:

On Fri, May 10, 2019 at 2:33 PM Steven D'Aprano <steve@pearwood.info> wrote:
Something like storing a running metric of the code in a kinda multithreaded environment is a common thing to do. Say, it could be a metric of an execution duration of a function which could be accessed from within a Flask app and/or from within celery tasks. This couldn't be achieved using in-memory storage as web server has its own isolated memory, as well as each celery worker has. So, I would like to implement my own storage for this data, e.g. using Redis.

On Sun, 5 May 2019 at 11:08, Luca Baldini <luca.baldini@pi.infn.it> wrote:
There was discussion around this when the PEP was written. I think this is what is alluded to in the PEP with """ There is considerable interest in including one-pass functions that can calculate multiple statistics from data in iterator form, without having to convert to a list. The experimental stats package on PyPI includes co-routine versions of statistics functions. Including these will be deferred to 3.5. """ https://www.python.org/dev/peps/pep-0450/#future-work At the time I believe there was brief discussion about what the API should look like but it was deferred for future work. For the sakes of argument how about: from statistics import RunningStats, RunningMean, RunningMax stats = RunningStats({'mean':RunningMean, 'max':RunningMax}) stats.push_many([1, 5, 2, 4]) print(stats.running_stats()) # {'mean': 3, 'max': 5} -- Oscar

Hi Luca, I'm the original author of the statistics module, and I'm very interested in your idea for calculating running statistics. However feature-freeze for 3.8 is not far away (about three weeks) so I think it would have to be deferred until 3.9. But I encourage you to give some thought (either privately, or publicly here in this thread) about the features you want to see. -- Steven

On Sun, May 5, 2019 at 1:08 PM Luca Baldini <luca.baldini@pi.infn.it> wrote:
Personally, I would definitely use this in a number of places in the real-life code I contribute to. The problem that I have with this idea is it's not clear how to store the data in an accumulator class. What about cases with different contexts in asyncio and/or multithreading code? I would say it could be useful to allow to pass a storage implementation from a user's code to address almost any possible scenario. In that case, such an accumulator class doesn't need to be a class at all and bother with any intermediate storage. It could be a number of module-level functions providing an effective algorythm implementation for user to be able to base on.

I've often wanted a windowing function in itertools. One exists as a recipe in the docs. If I remember correctly, one reason this was never implemented is that the most efficient implementation changes depending on the size of the window. Use a deque(maxsize=n) for large windows and tuple slicing/concat for tiny windows. I'm not sure how the tee/zip trick compares. On Mon, May 6, 2019, 10:11 AM Serge Matveenko <s@matveenko.ru> wrote:

On Fri, May 10, 2019 at 2:33 PM Steven D'Aprano <steve@pearwood.info> wrote:
Something like storing a running metric of the code in a kinda multithreaded environment is a common thing to do. Say, it could be a metric of an execution duration of a function which could be accessed from within a Flask app and/or from within celery tasks. This couldn't be achieved using in-memory storage as web server has its own isolated memory, as well as each celery worker has. So, I would like to implement my own storage for this data, e.g. using Redis.

On Sun, 5 May 2019 at 11:08, Luca Baldini <luca.baldini@pi.infn.it> wrote:
There was discussion around this when the PEP was written. I think this is what is alluded to in the PEP with """ There is considerable interest in including one-pass functions that can calculate multiple statistics from data in iterator form, without having to convert to a list. The experimental stats package on PyPI includes co-routine versions of statistics functions. Including these will be deferred to 3.5. """ https://www.python.org/dev/peps/pep-0450/#future-work At the time I believe there was brief discussion about what the API should look like but it was deferred for future work. For the sakes of argument how about: from statistics import RunningStats, RunningMean, RunningMax stats = RunningStats({'mean':RunningMean, 'max':RunningMax}) stats.push_many([1, 5, 2, 4]) print(stats.running_stats()) # {'mean': 3, 'max': 5} -- Oscar

Hi Luca, I'm the original author of the statistics module, and I'm very interested in your idea for calculating running statistics. However feature-freeze for 3.8 is not far away (about three weeks) so I think it would have to be deferred until 3.9. But I encourage you to give some thought (either privately, or publicly here in this thread) about the features you want to see. -- Steven
participants (5)
-
Luca Baldini
-
Michael Selik
-
Oscar Benjamin
-
Serge Matveenko
-
Steven D'Aprano