Python 3.x Adoption for PyPI and PyPI Download Numbers

Just thought I'd share this since it shows how what people are using to download things from PyPI have changed over the past year. Of particular interest to most people will be the final graphs showing what percentage of downloads from PyPI are for Python 3.x or 2.x. As always it's good to keep in mind, "Lies, Damn Lies, and Statistics". I've tried not to bias the results too much, but some bias is unavoidable. Of particular note is that a lot of these numbers come from pip, and as of version 6.0 of pip, pip will cache downloads by default. This would mean that older versions of pip are more likely to "inflate" the downloads than newer versions since they don't cache by default. In addition if a project has a file which is used for both 2.x and 3.x and they do a ``pip install`` on the 2.x version first then it will show up as counted under 2.x but not 3.x due to caching (and of course the inverse is true, if they install on 3.x first it won't show up on 2.x). Here's the link: https://caremad.io/2015/04/a-year-of-pypi-downloads/ Anyways, I'll have access to the data set for another day or two before I shut down the (expensive) server that I have to use to crunch the numbers so if there's anything anyone else wants to see before I shut it down, speak up soon. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

On Tue, Apr 21, 2015 at 01:54:55PM -0400, Donald Stufft wrote:
Where are curl and wget getting categorized in the User Agent graphs? Just morbidly curious as to whether they're in with Browser and therefore mostly unused or Unknown and therefore only slightly less unused ;-) -Toshio

They get classified as Unknown, here’s the hacky script I use to parse the log files: https://bpaste.net/show/515017c78e32 --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

On Tue, Apr 21, 2015 at 10:55 AM Donald Stufft <donald@stufft.io> wrote:
Thanks! I like your focus on particular packages of note such as django and requests. How do CDNs influence these "lies"? I thought the download counts on PyPI were effectively meaningless due to CDN mirrors fetching and hosting things? Do we have user-agent logs from all PyPI package CDN mirrors or just from the master? -gps

We took the download counts offline for awhile because of the CDN, however within a month or two (now almost two years ago) they enabled logs on our account to bring them back. So these numbers are from the CDN edge and they reflect the “true” traffic. I say “true” because although we have logs, logging isn’t considered an essential service so in times of problems logging can be reduced or disabled completely (you can see in the data set some weeks had a massive drop, this was due to missing a day or two of logs). That being said though, ontop of the Fastly provided CDN, there is also the ability to mirror PyPI (which shows up as bandersnatch or others in the logs) and if someone is installing from a mirror we don’t see that data at all. On top of that, all versions of pip prior to 6.0 had an opt in download cache which would mean that, on an opt in basis, we wouldn’t see downloads for those people and since 6.0 there is now an opt-out cache. Specifically to the mirror network itself, that represents about 20% of the total traffic on PyPI, however we can determine when it was a mirror and those downloads show up as “Unknown” in other charts since it’s a mirror client we don’t know what the final target environment will be. This might mean that future snapshots will look at API accesses instead, or perhaps we try to implement some sort of optional popcon or maybe we continue to look at package installs and we just interpret the data with the knowledge that these things are at play. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

On Tue, Apr 21, 2015 at 01:54:55PM -0400, Donald Stufft wrote:
Where are curl and wget getting categorized in the User Agent graphs? Just morbidly curious as to whether they're in with Browser and therefore mostly unused or Unknown and therefore only slightly less unused ;-) -Toshio

They get classified as Unknown, here’s the hacky script I use to parse the log files: https://bpaste.net/show/515017c78e32 --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

On Tue, Apr 21, 2015 at 10:55 AM Donald Stufft <donald@stufft.io> wrote:
Thanks! I like your focus on particular packages of note such as django and requests. How do CDNs influence these "lies"? I thought the download counts on PyPI were effectively meaningless due to CDN mirrors fetching and hosting things? Do we have user-agent logs from all PyPI package CDN mirrors or just from the master? -gps

We took the download counts offline for awhile because of the CDN, however within a month or two (now almost two years ago) they enabled logs on our account to bring them back. So these numbers are from the CDN edge and they reflect the “true” traffic. I say “true” because although we have logs, logging isn’t considered an essential service so in times of problems logging can be reduced or disabled completely (you can see in the data set some weeks had a massive drop, this was due to missing a day or two of logs). That being said though, ontop of the Fastly provided CDN, there is also the ability to mirror PyPI (which shows up as bandersnatch or others in the logs) and if someone is installing from a mirror we don’t see that data at all. On top of that, all versions of pip prior to 6.0 had an opt in download cache which would mean that, on an opt in basis, we wouldn’t see downloads for those people and since 6.0 there is now an opt-out cache. Specifically to the mirror network itself, that represents about 20% of the total traffic on PyPI, however we can determine when it was a mirror and those downloads show up as “Unknown” in other charts since it’s a mirror client we don’t know what the final target environment will be. This might mean that future snapshots will look at API accesses instead, or perhaps we try to implement some sort of optional popcon or maybe we continue to look at package installs and we just interpret the data with the knowledge that these things are at play. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
participants (4)
-
Donald Stufft
-
Gregory P. Smith
-
Guido van Rossum
-
Toshio Kuratomi