[Python-ideas] [Python-Dev] Python 2.x and 3.x use survey, 2014 edition

Donald Stufft donald at stufft.io
Fri Dec 12 06:54:38 CET 2014

> On Dec 12, 2014, at 12:41 AM, Nathaniel Smith <njs at pobox.com> wrote:
> On 11 Dec 2014 15:14, "Donald Stufft" <donald at stufft.io <mailto:donald at stufft.io>> wrote:
> >
> >
> > This information is a few months old mostly because I’m lazy and creating the information is a pain in the ass.
> >
> > Total Downloads (For reference): http://d.stufft.io/image/2N293l3v2S1c <http://d.stufft.io/image/2N293l3v2S1c>
> > % Downloads for Python Version: http://d.stufft.io/image/2g1T2U140h1O <http://d.stufft.io/image/2g1T2U140h1O>
> > % Downloads for Python Version (Zoomed to Py3): http://d.stufft.io/image/0B233A151k1k <http://d.stufft.io/image/0B233A151k1k>
> > Total Downloads for Python Version: http://d.stufft.io/image/3f3f3g3P181M <http://d.stufft.io/image/3f3f3g3P181M>
> > Bonus - OS Downloads: http://d.stufft.io/image/021v383I0O2c <http://d.stufft.io/image/021v383I0O2c>
> >
> > All of the above filter out anything that has an extremely small number of downloads so as not to overwhelm the graphs with a ton of small barely used things.
> Neat data, thanks for sharing!
> I do wonder how meaningful it is, though, because my impression is that PyPI download numbers are overwhelmingly driven by automated test and deployment systems (e.g. Travis-CI) that end up downloading the same dependencies dozens of times a day. Among other things this would explain how it could be that Linux downloads appear to outnumber Windows downloads by an unbelievable factor of ~30x (!). This doesn't invalidate the numbers, of course, but it does mean they may only represent one specific slice of Python's userbase.
> Another way to get a sense of py2 versus py3 usage is to look at download counts for version-specific wheels on non-linux systems. Some quick playing with vanity suggests that lxml windows downloads are about 10% py3 (even though the only py3 builds they offer are for 32-bit py3.2!), and numpy osx downloads are about 19% py3. I don't know how representative these numbers are either, but they're dramatically higher than what you found. If someone's curious it might be worth trying this approach more systematically.
> -n


Also things like people who use the pip download cache (currently off by default, will be on by default in the next version) will only get a download count triggered for the first download of a particular file so something that is not Python specific if someone ran pip install twice in a row, with 2.7 first and 3.4 second _with_ the download cache on, would only register as a single 2.7 download.

I have some other numbers that are specific to certain packages too. I don’t know what conclusion can be drawn from them, but here’s those too:

Django: http://d.stufft.io/image/0Q3M2q1M070z <http://d.stufft.io/image/0Q3M2q1M070z>
cryptography: http://d.stufft.io/image/2p0f1F1O3D3P <http://d.stufft.io/image/2p0f1F1O3D3P>
requests: http://d.stufft.io/image/2c2R2f043W10 <http://d.stufft.io/image/2c2R2f043W10>
pip: http://d.stufft.io/image/3l3M2d2U343C <http://d.stufft.io/image/3l3M2d2U343C>
Twisted: http://d.stufft.io/image/031u0x2d1A3v <http://d.stufft.io/image/031u0x2d1A3v>
cffi: http://d.stufft.io/image/2H0I2p1A0M2a <http://d.stufft.io/image/2H0I2p1A0M2a>

It might be possible to make some conclusions about different “slices” looking at these.

Unfortunately currently computing this is a fairly intensive task, I have to load months worth of raw logs into a PostgreSQL server, which needs to have a ton of RAM and some SSDs to make the process not take _forever_. That takes a few days normally when I do it (that’s not being super efficient, but it’s a fairly steady pace). The server it resides on is something like $1600 a month when I spin it up (though that doesn’t cost anything since it’s on an OSS Rackspace account) and querying takes 10-15 minutes or so.

Luckily! There is an effort under way to move this into Google’s BigQuery system and doing it on a daily basis. This will hopefully lead to both being able to get whatever random one off queries people like as well as make it possible to integrate with PyPI itself to put graphs like these on PyPI.

Bonus Because I have them laying around:

What kind of distribution is being downloaded: http://d.stufft.io/image/2F0N0f3V3E0V <http://d.stufft.io/image/2F0N0f3V3E0V>
What is actually downloading things: http://d.stufft.io/image/1o0D2g2D2N3D <http://d.stufft.io/image/1o0D2g2D2N3D>

Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20141212/f5ba1566/attachment-0001.html>

More information about the Python-ideas mailing list