[Distutils] Publicly Queryable Statistics

Sun May 22 03:39:51 EDT 2016

- to query, say, a month's worth of data, what would need to be done?
- "sharded by day" ... UTC?

On Saturday, May 21, 2016, Donald Stufft <donald at stufft.io> wrote:

> Hey,
>
> One thing I’ve been working on as part of Warehouse, is a subproject that
> I call “Linehaul”. This is essentially a little statistics daemon that will
> take specially formatted syslog messages coming off of Fastly and shove
> them inside of a BigQuery database. I’m happy to report that I’ve just
> finished the production deployment of this and we’re now sending every
> download event that hits Fastly into BigQuery.
>
> First off, I’d like to thank Felipe Hoffa, Will Curran, and Preston Holmes
> over at Google for helping me get credits for this sorted out so that we
> can actually get this going! They’ve been a big help.
>
> So onto what this means.
>
> Basically, BigQuery gives us the ability to relatively quickly (typically
> < 60s) query very large datasets using something that is very similar to
> SQL. Unlike typical time series databases we don’t have to know ahead of
> time what we want to query on, we can just insert data into rows in a table
> (and our tables are sharded by days) and then using the SQLlike query
> language, you can do any sort of query you like.
>
> On top of all of this, BigQuery gives me the ability to share the dataset
> publicly with anyone who is logged into a Google account, which means that
> *anyone* can query this data and look for any sort of interesting
> information they can find in it. The cost of any queries you run will be
> associated with your own account (but the first 1TB of data a month that
> you query is free I believe, nor are you charged for queries that error out
> or return cached results).
>
> Anyways, you can query these BigQuery tables whose names match the pattern
> `the-psf:pypi.downloadsYYYYMMDD` and you can see whatever data you want.
>
> We’ve only just started recorded the data in this spot so right now there
> isn’t a whole lot of data available (but it’s constantly streaming in).
> Once the 22nd rolls over I’m going to delete what data we have available
> for the 21st, and then start backfilling historical data starting with the
> 21st and going backwards. You should be able to run queries in a Web UI by
> navigating to https://bigquery.cloud.google.com/dataset/the-psf:pypi (you
> might have to accept a Terms of Service).
>
> The table schema looks like: https://s.caremad.io/lPpTF6rxWZ/ but it
> should also be visible on the big query page. Some example queries you
> might want to run are located at
> https://gist.github.com/alex/4f100a9592b05e9b4d63 (but note, that’s
> currently using the *old* not publicly available table name, you’ll need to
> replace [long-stack-762:pypi.downloads] with [the-psf.pypi:downloads]).
> If you want to write your own queries, you should be able to find the
> syntax here: https://cloud.google.com/bigquery/query-reference
>
> Anyways, new data should constantly be streaming in, and I should be able
> to backfill data all the way to Jan of 2014 or so. Hopefully this is useful
> to folks, and if you find any interesting queries or numbers, please share
> them!
>
> —
> Donald Stufft
>
>
>
> _______________________________________________
> Distutils-SIG maillist  -  Distutils-SIG at python.org <javascript:;>
> https://mail.python.org/mailman/listinfo/distutils-sig
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20160522/1031dedf/attachment.html>