Re: [Python-ideas] [Python-Dev] Python 2.x and 3.x use survey, 2014 edition

On Thu Dec 11 2014 at 11:48:22 AM Giampaolo Rodola' <g.rodola@gmail.com> wrote:
It would be really nice to complement this survey with information gathered from PyPI. I think the python version of the client is not being recorded now, right? I think this could be very useful not only for the Python devs and the community, but particularly for package maintainers. In this way we could find out not only the number of times a given package was downloaded, but also for which python version. I understand that this information might not be entirely accurate (i.e. people downloading the file from the browser or from github) or using other repositories (like the ones provided by anaconda), but still I think it could be useful. And with pip becoming the standard tool, it might be a good moment to do it. cheers, Hernán

This information is a few months old mostly because I’m lazy and creating the information is a pain in the ass. Total Downloads (For reference): http://d.stufft.io/image/2N293l3v2S1c <http://d.stufft.io/image/2N293l3v2S1c> % Downloads for Python Version: http://d.stufft.io/image/2g1T2U140h1O <http://d.stufft.io/image/2g1T2U140h1O> % Downloads for Python Version (Zoomed to Py3): http://d.stufft.io/image/0B233A151k1k <http://d.stufft.io/image/0B233A151k1k> Total Downloads for Python Version: http://d.stufft.io/image/3f3f3g3P181M <http://d.stufft.io/image/3f3f3g3P181M> Bonus - OS Downloads: http://d.stufft.io/image/021v383I0O2c <http://d.stufft.io/image/021v383I0O2c> All of the above filter out anything that has an extremely small number of downloads so as not to overwhelm the graphs with a ton of small barely used things. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

On Thu Dec 11 2014 at 12:34:41 PM Donald Stufft <donald@stufft.io> wrote:
I’m one of the PyPI administrators and I compile it from the raw logs.
It would be nice to expose an API for this. I imagine having a tool like vanity[0] but also providing per Python version/OS information. Anyway we could help? cheers, Hernan [0] https://pypi.python.org/pypi/vanity/

On 11 December 2014 at 15:41, Hernan Grecco <hernan.grecco@gmail.com> wrote:
It would be awesome to have these stats for CPython too, possible? Very happy to put up a website with nice visualizations if I can access the raw data. In any case, useful survey, and for me the only blocker for ditching py2 is the google app engine. Regards Luca -- http://lucasbardella.com

On 12/11/2014 5:38 PM, Luca Sbardella wrote:
It would be awesome to have these stats for CPython too, possible?
The last report anyone posted (pydev, maybe a year ago) stats for CPython downloads from python.org, I think 2.x and 3.x were about equal and Windows dominated. I am curious too on data since. -- Terry Jan Reedy

On 12/11/2014 10:14 AM, Donald Stufft wrote:
My interpretation: 1. Dropping 2.4, 2.5 support to support 3.x can easily be justified. 2. Starting with 3.3 (for instance, to use u'abc') can easily be justified. -- Terry Jan Reedy

On 11 Dec 2014 15:14, "Donald Stufft" <donald@stufft.io> wrote:
This information is a few months old mostly because I’m lazy and creating
the information is a pain in the ass.
http://d.stufft.io/image/0B233A151k1k
Neat data, thanks for sharing! I do wonder how meaningful it is, though, because my impression is that PyPI download numbers are overwhelmingly driven by automated test and deployment systems (e.g. Travis-CI) that end up downloading the same dependencies dozens of times a day. Among other things this would explain how it could be that Linux downloads appear to outnumber Windows downloads by an unbelievable factor of ~30x (!). This doesn't invalidate the numbers, of course, but it does mean they may only represent one specific slice of Python's userbase. Another way to get a sense of py2 versus py3 usage is to look at download counts for version-specific wheels on non-linux systems. Some quick playing with vanity suggests that lxml windows downloads are about 10% py3 (even though the only py3 builds they offer are for 32-bit py3.2!), and numpy osx downloads are about 19% py3. I don't know how representative these numbers are either, but they're dramatically higher than what you found. If someone's curious it might be worth trying this approach more systematically. -n

Certainly. Also things like people who use the pip download cache (currently off by default, will be on by default in the next version) will only get a download count triggered for the first download of a particular file so something that is not Python specific if someone ran pip install twice in a row, with 2.7 first and 3.4 second _with_ the download cache on, would only register as a single 2.7 download. I have some other numbers that are specific to certain packages too. I don’t know what conclusion can be drawn from them, but here’s those too: Django: http://d.stufft.io/image/0Q3M2q1M070z <http://d.stufft.io/image/0Q3M2q1M070z> cryptography: http://d.stufft.io/image/2p0f1F1O3D3P <http://d.stufft.io/image/2p0f1F1O3D3P> requests: http://d.stufft.io/image/2c2R2f043W10 <http://d.stufft.io/image/2c2R2f043W10> pip: http://d.stufft.io/image/3l3M2d2U343C <http://d.stufft.io/image/3l3M2d2U343C> Twisted: http://d.stufft.io/image/031u0x2d1A3v <http://d.stufft.io/image/031u0x2d1A3v> cffi: http://d.stufft.io/image/2H0I2p1A0M2a <http://d.stufft.io/image/2H0I2p1A0M2a> It might be possible to make some conclusions about different “slices” looking at these. Unfortunately currently computing this is a fairly intensive task, I have to load months worth of raw logs into a PostgreSQL server, which needs to have a ton of RAM and some SSDs to make the process not take _forever_. That takes a few days normally when I do it (that’s not being super efficient, but it’s a fairly steady pace). The server it resides on is something like $1600 a month when I spin it up (though that doesn’t cost anything since it’s on an OSS Rackspace account) and querying takes 10-15 minutes or so. Luckily! There is an effort under way to move this into Google’s BigQuery system and doing it on a daily basis. This will hopefully lead to both being able to get whatever random one off queries people like as well as make it possible to integrate with PyPI itself to put graphs like these on PyPI. Bonus Because I have them laying around: What kind of distribution is being downloaded: http://d.stufft.io/image/2F0N0f3V3E0V <http://d.stufft.io/image/2F0N0f3V3E0V> What is actually downloading things: http://d.stufft.io/image/1o0D2g2D2N3D <http://d.stufft.io/image/1o0D2g2D2N3D> --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

On Dec 12, 2014 6:41 AM, "Nathaniel Smith" <njs@pobox.com> wrote:
On 11 Dec 2014 15:14, "Donald Stufft" <donald@stufft.io> wrote:
This information is a few months old mostly because I’m lazy and
creating the information is a pain in the ass.
Translating these numbers to actual usage in general is hard. The way in which python and packages are distributed on different platforms is very different. How comfortable the average user is with using the command-line vs. just relying on pre-built packages or installers is different. though the only py3 builds they offer are for 32-bit py3.2!), and numpy osx downloads are about 19% py3. I don't know how representative these numbers are either, but they're dramatically higher than what you found. If someone's curious it might be worth trying this approach more systematically. Would it be possible to add an API, flag, or argument to pypi that lets automated services like Travis and py2pack to identify themselves as not being ordinary downloads? Of course this would depend on the services making use of it, but they seem to be trying to be good members of the ecosystem so I would like to think they would.

On Fri, 12 Dec 2014 10:48:05 +0100 Todd <toddrjen@gmail.com> wrote:
Those services (or at least Travis) usually invoke hand-written scripts, so this would rely on every developer being a "good citizen" and updating their scripts to use that option. Perhaps filtering by source IP would work better, if you could identify the IPs used by Travis VMs and the like. Regards Antoine.

This information is a few months old mostly because I’m lazy and creating the information is a pain in the ass. Total Downloads (For reference): http://d.stufft.io/image/2N293l3v2S1c <http://d.stufft.io/image/2N293l3v2S1c> % Downloads for Python Version: http://d.stufft.io/image/2g1T2U140h1O <http://d.stufft.io/image/2g1T2U140h1O> % Downloads for Python Version (Zoomed to Py3): http://d.stufft.io/image/0B233A151k1k <http://d.stufft.io/image/0B233A151k1k> Total Downloads for Python Version: http://d.stufft.io/image/3f3f3g3P181M <http://d.stufft.io/image/3f3f3g3P181M> Bonus - OS Downloads: http://d.stufft.io/image/021v383I0O2c <http://d.stufft.io/image/021v383I0O2c> All of the above filter out anything that has an extremely small number of downloads so as not to overwhelm the graphs with a ton of small barely used things. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

On Thu Dec 11 2014 at 12:34:41 PM Donald Stufft <donald@stufft.io> wrote:
I’m one of the PyPI administrators and I compile it from the raw logs.
It would be nice to expose an API for this. I imagine having a tool like vanity[0] but also providing per Python version/OS information. Anyway we could help? cheers, Hernan [0] https://pypi.python.org/pypi/vanity/

On 11 December 2014 at 15:41, Hernan Grecco <hernan.grecco@gmail.com> wrote:
It would be awesome to have these stats for CPython too, possible? Very happy to put up a website with nice visualizations if I can access the raw data. In any case, useful survey, and for me the only blocker for ditching py2 is the google app engine. Regards Luca -- http://lucasbardella.com

On 12/11/2014 5:38 PM, Luca Sbardella wrote:
It would be awesome to have these stats for CPython too, possible?
The last report anyone posted (pydev, maybe a year ago) stats for CPython downloads from python.org, I think 2.x and 3.x were about equal and Windows dominated. I am curious too on data since. -- Terry Jan Reedy

On 12/11/2014 10:14 AM, Donald Stufft wrote:
My interpretation: 1. Dropping 2.4, 2.5 support to support 3.x can easily be justified. 2. Starting with 3.3 (for instance, to use u'abc') can easily be justified. -- Terry Jan Reedy

On 11 Dec 2014 15:14, "Donald Stufft" <donald@stufft.io> wrote:
This information is a few months old mostly because I’m lazy and creating
the information is a pain in the ass.
http://d.stufft.io/image/0B233A151k1k
Neat data, thanks for sharing! I do wonder how meaningful it is, though, because my impression is that PyPI download numbers are overwhelmingly driven by automated test and deployment systems (e.g. Travis-CI) that end up downloading the same dependencies dozens of times a day. Among other things this would explain how it could be that Linux downloads appear to outnumber Windows downloads by an unbelievable factor of ~30x (!). This doesn't invalidate the numbers, of course, but it does mean they may only represent one specific slice of Python's userbase. Another way to get a sense of py2 versus py3 usage is to look at download counts for version-specific wheels on non-linux systems. Some quick playing with vanity suggests that lxml windows downloads are about 10% py3 (even though the only py3 builds they offer are for 32-bit py3.2!), and numpy osx downloads are about 19% py3. I don't know how representative these numbers are either, but they're dramatically higher than what you found. If someone's curious it might be worth trying this approach more systematically. -n

Certainly. Also things like people who use the pip download cache (currently off by default, will be on by default in the next version) will only get a download count triggered for the first download of a particular file so something that is not Python specific if someone ran pip install twice in a row, with 2.7 first and 3.4 second _with_ the download cache on, would only register as a single 2.7 download. I have some other numbers that are specific to certain packages too. I don’t know what conclusion can be drawn from them, but here’s those too: Django: http://d.stufft.io/image/0Q3M2q1M070z <http://d.stufft.io/image/0Q3M2q1M070z> cryptography: http://d.stufft.io/image/2p0f1F1O3D3P <http://d.stufft.io/image/2p0f1F1O3D3P> requests: http://d.stufft.io/image/2c2R2f043W10 <http://d.stufft.io/image/2c2R2f043W10> pip: http://d.stufft.io/image/3l3M2d2U343C <http://d.stufft.io/image/3l3M2d2U343C> Twisted: http://d.stufft.io/image/031u0x2d1A3v <http://d.stufft.io/image/031u0x2d1A3v> cffi: http://d.stufft.io/image/2H0I2p1A0M2a <http://d.stufft.io/image/2H0I2p1A0M2a> It might be possible to make some conclusions about different “slices” looking at these. Unfortunately currently computing this is a fairly intensive task, I have to load months worth of raw logs into a PostgreSQL server, which needs to have a ton of RAM and some SSDs to make the process not take _forever_. That takes a few days normally when I do it (that’s not being super efficient, but it’s a fairly steady pace). The server it resides on is something like $1600 a month when I spin it up (though that doesn’t cost anything since it’s on an OSS Rackspace account) and querying takes 10-15 minutes or so. Luckily! There is an effort under way to move this into Google’s BigQuery system and doing it on a daily basis. This will hopefully lead to both being able to get whatever random one off queries people like as well as make it possible to integrate with PyPI itself to put graphs like these on PyPI. Bonus Because I have them laying around: What kind of distribution is being downloaded: http://d.stufft.io/image/2F0N0f3V3E0V <http://d.stufft.io/image/2F0N0f3V3E0V> What is actually downloading things: http://d.stufft.io/image/1o0D2g2D2N3D <http://d.stufft.io/image/1o0D2g2D2N3D> --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

On Dec 12, 2014 6:41 AM, "Nathaniel Smith" <njs@pobox.com> wrote:
On 11 Dec 2014 15:14, "Donald Stufft" <donald@stufft.io> wrote:
This information is a few months old mostly because I’m lazy and
creating the information is a pain in the ass.
Translating these numbers to actual usage in general is hard. The way in which python and packages are distributed on different platforms is very different. How comfortable the average user is with using the command-line vs. just relying on pre-built packages or installers is different. though the only py3 builds they offer are for 32-bit py3.2!), and numpy osx downloads are about 19% py3. I don't know how representative these numbers are either, but they're dramatically higher than what you found. If someone's curious it might be worth trying this approach more systematically. Would it be possible to add an API, flag, or argument to pypi that lets automated services like Travis and py2pack to identify themselves as not being ordinary downloads? Of course this would depend on the services making use of it, but they seem to be trying to be good members of the ecosystem so I would like to think they would.

On Fri, 12 Dec 2014 10:48:05 +0100 Todd <toddrjen@gmail.com> wrote:
Those services (or at least Travis) usually invoke hand-written scripts, so this would rely on every developer being a "good citizen" and updating their scripts to use that option. Perhaps filtering by source IP would work better, if you could identify the IPs used by Travis VMs and the like. Regards Antoine.
participants (7)
-
Antoine Pitrou
-
Donald Stufft
-
Hernan Grecco
-
Luca Sbardella
-
Nathaniel Smith
-
Terry Reedy
-
Todd