[Distutils] Publicly Queryable Statistics

Sat May 21 14:21:04 EDT 2016

Hey,

One thing I’ve been working on as part of Warehouse, is a subproject that I call “Linehaul”. This is essentially a little statistics daemon that will take specially formatted syslog messages coming off of Fastly and shove them inside of a BigQuery database. I’m happy to report that I’ve just finished the production deployment of this and we’re now sending every download event that hits Fastly into BigQuery.

First off, I’d like to thank Felipe Hoffa, Will Curran, and Preston Holmes over at Google for helping me get credits for this sorted out so that we can actually get this going! They’ve been a big help.

So onto what this means.

Basically, BigQuery gives us the ability to relatively quickly (typically < 60s) query very large datasets using something that is very similar to SQL. Unlike typical time series databases we don’t have to know ahead of time what we want to query on, we can just insert data into rows in a table (and our tables are sharded by days) and then using the SQLlike query language, you can do any sort of query you like.

On top of all of this, BigQuery gives me the ability to share the dataset publicly with anyone who is logged into a Google account, which means that *anyone* can query this data and look for any sort of interesting information they can find in it. The cost of any queries you run will be associated with your own account (but the first 1TB of data a month that you query is free I believe, nor are you charged for queries that error out or return cached results).

Anyways, you can query these BigQuery tables whose names match the pattern `the-psf:pypi.downloadsYYYYMMDD` and you can see whatever data you want.

We’ve only just started recorded the data in this spot so right now there isn’t a whole lot of data available (but it’s constantly streaming in). Once the 22nd rolls over I’m going to delete what data we have available for the 21st, and then start backfilling historical data starting with the 21st and going backwards. You should be able to run queries in a Web UI by navigating to https://bigquery.cloud.google.com/dataset/the-psf:pypi (you might have to accept a Terms of Service).

The table schema looks like: https://s.caremad.io/lPpTF6rxWZ/ but it should also be visible on the big query page. Some example queries you might want to run are located at https://gist.github.com/alex/4f100a9592b05e9b4d63 (but note, that’s currently using the *old* not publicly available table name, you’ll need to replace [long-stack-762:pypi.downloads] with [the-psf.pypi:downloads]).
If you want to write your own queries, you should be able to find the syntax here: https://cloud.google.com/bigquery/query-reference

Anyways, new data should constantly be streaming in, and I should be able to backfill data all the way to Jan of 2014 or so. Hopefully this is useful to folks, and if you find any interesting queries or numbers, please share them!

—
Donald Stufft