Python 3.x Adoption for PyPI and PyPI Download Numbers
Just thought I'd share this since it shows how what people are using to download things from PyPI have changed over the past year. Of particular interest to most people will be the final graphs showing what percentage of downloads from PyPI are for Python 3.x or 2.x. As always it's good to keep in mind, "Lies, Damn Lies, and Statistics". I've tried not to bias the results too much, but some bias is unavoidable. Of particular note is that a lot of these numbers come from pip, and as of version 6.0 of pip, pip will cache downloads by default. This would mean that older versions of pip are more likely to "inflate" the downloads than newer versions since they don't cache by default. In addition if a project has a file which is used for both 2.x and 3.x and they do a ``pip install`` on the 2.x version first then it will show up as counted under 2.x but not 3.x due to caching (and of course the inverse is true, if they install on 3.x first it won't show up on 2.x). Here's the link: https://caremad.io/2015/04/a-year-of-pypi-downloads/ Anyways, I'll have access to the data set for another day or two before I shut down the (expensive) server that I have to use to crunch the numbers so if there's anything anyone else wants to see before I shut it down, speak up soon. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
Thanks for the detailed research! On Tue, Apr 21, 2015 at 10:54 AM, Donald Stufft <donald@stufft.io> wrote:
Just thought I'd share this since it shows how what people are using to download things from PyPI have changed over the past year. Of particular interest to most people will be the final graphs showing what percentage of downloads from PyPI are for Python 3.x or 2.x.
As always it's good to keep in mind, "Lies, Damn Lies, and Statistics". I've tried not to bias the results too much, but some bias is unavoidable. Of particular note is that a lot of these numbers come from pip, and as of version 6.0 of pip, pip will cache downloads by default. This would mean that older versions of pip are more likely to "inflate" the downloads than newer versions since they don't cache by default. In addition if a project has a file which is used for both 2.x and 3.x and they do a ``pip install`` on the 2.x version first then it will show up as counted under 2.x but not 3.x due to caching (and of course the inverse is true, if they install on 3.x first it won't show up on 2.x).
Here's the link: https://caremad.io/2015/04/a-year-of-pypi-downloads/
Anyways, I'll have access to the data set for another day or two before I shut down the (expensive) server that I have to use to crunch the numbers so if there's anything anyone else wants to see before I shut it down, speak up soon.
--- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (python.org/~guido)
On Tue, Apr 21, 2015 at 01:54:55PM -0400, Donald Stufft wrote:
Anyways, I'll have access to the data set for another day or two before I shut down the (expensive) server that I have to use to crunch the numbers so if there's anything anyone else wants to see before I shut it down, speak up soon.
Where are curl and wget getting categorized in the User Agent graphs? Just morbidly curious as to whether they're in with Browser and therefore mostly unused or Unknown and therefore only slightly less unused ;-) -Toshio
On Apr 21, 2015, at 3:15 PM, Toshio Kuratomi <a.badger@gmail.com> wrote:
On Tue, Apr 21, 2015 at 01:54:55PM -0400, Donald Stufft wrote:
Anyways, I'll have access to the data set for another day or two before I shut down the (expensive) server that I have to use to crunch the numbers so if there's anything anyone else wants to see before I shut it down, speak up soon.
Where are curl and wget getting categorized in the User Agent graphs?
Just morbidly curious as to whether they're in with Browser and therefore mostly unused or Unknown and therefore only slightly less unused ;-)
They get classified as Unknown, here’s the hacky script I use to parse the log files: https://bpaste.net/show/515017c78e32 --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
Donald Stufft schreef op 21-04-15 om 19:54:
Here's the link:https://caremad.io/2015/04/a-year-of-pypi-downloads/ Nice, thanks!
I think it is safe to assume that buildout is grouped under "setuptools" (as it uses setuptools under the hood)? I really like buildout, but I must say that the amount of traction behind pip is quite intimidating :-) Reinout -- Reinout van Rees http://reinout.vanrees.org/ reinout@vanrees.org http://www.nelen-schuurmans.nl/ "Learning history by destroying artifacts is a time-honored atrocity"
On Apr 21, 2015, at 6:18 PM, Reinout van Rees <reinout@vanrees.org> wrote:
Donald Stufft schreef op 21-04-15 om 19:54:
Here's the link:https://caremad.io/2015/04/a-year-of-pypi-downloads/ Nice, thanks!
I think it is safe to assume that buildout is grouped under "setuptools" (as it uses setuptools under the hood)?
I really like buildout, but I must say that the amount of traction behind pip is quite intimidating :-)
Assuming buildout doesn’t do anything that would cause the user-agent to be something other than what setuptools uses themselves, yes that’s a safe assumption. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
Donald Stufft schreef op 21-04-15 om 19:54:
Here's the link:https://caremad.io/2015/04/a-year-of-pypi-downloads/
The last graph is very, very weird. The 'requests' library is very popular. Why on earth has python 2.6 gone from 10% to 25% market share, eating into python 2.7's share? I haven't seen that in any of the other graphs. Weird :-) Reinout -- Reinout van Rees http://reinout.vanrees.org/ reinout@vanrees.org http://www.nelen-schuurmans.nl/ "Learning history by destroying artifacts is a time-honored atrocity"
On Apr 21, 2015, at 6:26 PM, Reinout van Rees <reinout@vanrees.org> wrote:
Donald Stufft schreef op 21-04-15 om 19:54:
Here's the link:https://caremad.io/2015/04/a-year-of-pypi-downloads/
The last graph is very, very weird.
The 'requests' library is very popular. Why on earth has python 2.6 gone from 10% to 25% market share, eating into python 2.7's share? I haven't seen that in any of the other graphs.
Weird :-)
I have a few guesses: * RHEL6 is still a major deployment target, but the versions of things packaged in it are getting older and older which incentivizes people on that platform to start downloading things from PyPI instead of their package managers. Doing this for pure python libraries like requests is a lot easier than compiling a whole new Python for your RHEL6 box. * People using Python 2.6 are more likely to also be using a version of pip prior to 6.0 when caching was enabled by default, so you actually have more people installing on 2.7 but PyPI never sees that because it just serves from cache whereas Python 2.6 users are not getting cached values. * People are using a newer version of pip that caches by default, and they are running their tests with something like tox via ``tox`` and they have their envlist sorted like: py26,py27,py32,etc. In this case if they don’t already have the file cached they’ll download it with Python 2.6, cache it, then re-use that cache for all the subsequent Pythons. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
On Tue, Apr 21, 2015 at 10:55 AM Donald Stufft <donald@stufft.io> wrote:
Just thought I'd share this since it shows how what people are using to download things from PyPI have changed over the past year. Of particular interest to most people will be the final graphs showing what percentage of downloads from PyPI are for Python 3.x or 2.x.
As always it's good to keep in mind, "Lies, Damn Lies, and Statistics". I've tried not to bias the results too much, but some bias is unavoidable. Of particular note is that a lot of these numbers come from pip, and as of version 6.0 of pip, pip will cache downloads by default. This would mean that older versions of pip are more likely to "inflate" the downloads than newer versions since they don't cache by default. In addition if a project has a file which is used for both 2.x and 3.x and they do a ``pip install`` on the 2.x version first then it will show up as counted under 2.x but not 3.x due to caching (and of course the inverse is true, if they install on 3.x first it won't show up on 2.x).
Here's the link: https://caremad.io/2015/04/a-year-of-pypi-downloads/
Anyways, I'll have access to the data set for another day or two before I shut down the (expensive) server that I have to use to crunch the numbers so if there's anything anyone else wants to see before I shut it down, speak up soon.
Thanks! I like your focus on particular packages of note such as django and requests. How do CDNs influence these "lies"? I thought the download counts on PyPI were effectively meaningless due to CDN mirrors fetching and hosting things? Do we have user-agent logs from all PyPI package CDN mirrors or just from the master? -gps
On Apr 21, 2015, at 11:35 PM, Gregory P. Smith <greg@krypto.org> wrote:
On Tue, Apr 21, 2015 at 10:55 AM Donald Stufft <donald@stufft.io <mailto:donald@stufft.io>> wrote: Just thought I'd share this since it shows how what people are using to download things from PyPI have changed over the past year. Of particular interest to most people will be the final graphs showing what percentage of downloads from PyPI are for Python 3.x or 2.x.
As always it's good to keep in mind, "Lies, Damn Lies, and Statistics". I've tried not to bias the results too much, but some bias is unavoidable. Of particular note is that a lot of these numbers come from pip, and as of version 6.0 of pip, pip will cache downloads by default. This would mean that older versions of pip are more likely to "inflate" the downloads than newer versions since they don't cache by default. In addition if a project has a file which is used for both 2.x and 3.x and they do a ``pip install`` on the 2.x version first then it will show up as counted under 2.x but not 3.x due to caching (and of course the inverse is true, if they install on 3.x first it won't show up on 2.x).
Here's the link: https://caremad.io/2015/04/a-year-of-pypi-downloads/ <https://caremad.io/2015/04/a-year-of-pypi-downloads/>
Anyways, I'll have access to the data set for another day or two before I shut down the (expensive) server that I have to use to crunch the numbers so if there's anything anyone else wants to see before I shut it down, speak up soon.
Thanks!
I like your focus on particular packages of note such as django and requests.
How do CDNs influence these "lies"? I thought the download counts on PyPI were effectively meaningless due to CDN mirrors fetching and hosting things?
Do we have user-agent logs from all PyPI package CDN mirrors or just from the master?
-gps
We took the download counts offline for awhile because of the CDN, however within a month or two (now almost two years ago) they enabled logs on our account to bring them back. So these numbers are from the CDN edge and they reflect the “true” traffic. I say “true” because although we have logs, logging isn’t considered an essential service so in times of problems logging can be reduced or disabled completely (you can see in the data set some weeks had a massive drop, this was due to missing a day or two of logs). That being said though, ontop of the Fastly provided CDN, there is also the ability to mirror PyPI (which shows up as bandersnatch or others in the logs) and if someone is installing from a mirror we don’t see that data at all. On top of that, all versions of pip prior to 6.0 had an opt in download cache which would mean that, on an opt in basis, we wouldn’t see downloads for those people and since 6.0 there is now an opt-out cache. Specifically to the mirror network itself, that represents about 20% of the total traffic on PyPI, however we can determine when it was a mirror and those downloads show up as “Unknown” in other charts since it’s a mirror client we don’t know what the final target environment will be. This might mean that future snapshots will look at API accesses instead, or perhaps we try to implement some sort of optional popcon or maybe we continue to look at package installs and we just interpret the data with the knowledge that these things are at play. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
participants (5)
-
Donald Stufft
-
Gregory P. Smith
-
Guido van Rossum
-
Reinout van Rees
-
Toshio Kuratomi