Python package statistics
Terry Reedy
tjreedy at udel.edu
Fri Oct 18 14:54:11 EDT 2013
On 10/18/2013 8:41 AM, Yaşar Arabacı wrote:
> Hi people,
>
> I collected some data on PyPI and published some statistics about
> packages on PyPI. I think you might find it an interesting read:
>
> http://ysar.net/python/python-package-statistics.html
"b2gpopulate (36MB)
...
Total sizes on packages in PyPI amounted to 4.2 GB. Average package size
is 161 KB and standard deviation is 1MB."
For such highly skewed data, the mean and especially the standard
deviation and confidence intervals are meaningless. The are
'parameteric' statistics, which is to say, were designed for bell-shaped
distributions. (I will not say 'normal' == Guassian distributions
because they are *not* normal for much raw data.)
A better summary is obtained from either 'non-parametric' statistics
(median, inter-quartile range) or from 'normalizing' the data (if
possible). For the latter, try taking the square root or log of the
sizes and plot the distribution. If either works, take the mean and sd
of the transformed values. Then report those and also the transformed
back mean and mean+-sd.
--
Terry Jan Reedy
More information about the Python-list
mailing list