[New-bugs-announce] [issue37905] Remove NormalDist.overlap() or improve documentation?

Wed Aug 21 08:38:54 EDT 2019

New submission from Christoph Deil <deil.christoph at googlemail.com>:

I saw that Python 3.8 will add a NormalDist class:
https://docs.python.org/3.8/library/statistics.html#normaldist-objects

Personally I don't see the value of adding this to the Python standard lib. The natural progression would be to extend and extend, but in the end only duplicate what already exists in scientific Python packages.
But Ok, I guess this is not up for debate any more?

I'd like to make a specific comment on NormalDist.overlap.
The rest of NormalDist is very standard, but that method is an oddball.
My suggestion is to remove it or to improve the documentation.

Current docstring: https://github.com/python/cpython/blob/44f2c096804e8e3adc09400a59ef9c9ae843f339/Lib/statistics.py#L959-L991

And this docs example:
https://github.com/python/cpython/commit/318d537daabf2bd5f781255c7e25bfce260cf227#diff-d436928bc44b5d7c40a8047840f55d35R620-R629

> What percentage of men and women will have the same height in `two normally
distributed populations with known means and standard deviations
<http://www.usablestats.com/lessons/normal>`_?

50.3%

This statement doesn't make sense to me. No two people have the exact same height, I think the answer to this question should be 0%.

Using

n = 100_000; sum(m > w for m, w in zip(men.samples(n), women.samples(n))) / n

I see that for 82% of random (men, women) matches the man will be larger. That's another measure, but still, stating that 50% of men and women have the same height is confusing.

Note that there is a multitude of PDF overlap measures different from this min(pdf1, pdf2) that I think are much more common in statistics and the physical sciences:
- https://en.wikipedia.org/wiki/Hellinger_distance
- https://arxiv.org/pdf/1407.7172.pdf

And note that the references that are given currently are weird (basic statistics textbooks would be appropriate references IMO, or open references like Wikipedia)
- slides: http://www.iceaaonline.com/ready/wp-content/uploads/2014/06/MM-9-Presentation-Meet-the-Overlapping-Coefficient-A-Measure-for-Elevator-Speeches.pdf
- implementation code comment points to http://dx.doi.org/10.1080/03610928908830127 which is behind a paywall

Why add this one overlap measure and expose it under the "overlap" method name?

My suggestion would be to be conservative and to remove that method again, before releasing it in 3.8. A reference in the docs could be added to other existing third-party codes (e.g. scipy or the uncertainties package) with further functionality, such as being able to handle correlations or multi-dimensional distributions. For this change I'd be happy to send a PR any time.

Raymond and others interested in this topic - thoughts?

(note: I wrote a MultiNorm class prototype last year at https://github.com/cdeil/multinorm/blob/master/multinorm.py and now wanted to rewrite it and try to find a good API and thus was interested in this NormalDist class and what functionality it offers)

----------
components: Library (Lib)
messages: 350076
nosy: Christoph.Deil, rhettinger
priority: normal
severity: normal
status: open
title: Remove NormalDist.overlap() or improve documentation?
type: enhancement
versions: Python 3.8

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue37905>
_______________________________________