Publicly Queryable Statistics
Hey, One thing I’ve been working on as part of Warehouse, is a subproject that I call “Linehaul”. This is essentially a little statistics daemon that will take specially formatted syslog messages coming off of Fastly and shove them inside of a BigQuery database. I’m happy to report that I’ve just finished the production deployment of this and we’re now sending every download event that hits Fastly into BigQuery. First off, I’d like to thank Felipe Hoffa, Will Curran, and Preston Holmes over at Google for helping me get credits for this sorted out so that we can actually get this going! They’ve been a big help. So onto what this means. Basically, BigQuery gives us the ability to relatively quickly (typically < 60s) query very large datasets using something that is very similar to SQL. Unlike typical time series databases we don’t have to know ahead of time what we want to query on, we can just insert data into rows in a table (and our tables are sharded by days) and then using the SQLlike query language, you can do any sort of query you like. On top of all of this, BigQuery gives me the ability to share the dataset publicly with anyone who is logged into a Google account, which means that *anyone* can query this data and look for any sort of interesting information they can find in it. The cost of any queries you run will be associated with your own account (but the first 1TB of data a month that you query is free I believe, nor are you charged for queries that error out or return cached results). Anyways, you can query these BigQuery tables whose names match the pattern `the-psf:pypi.downloadsYYYYMMDD` and you can see whatever data you want. We’ve only just started recorded the data in this spot so right now there isn’t a whole lot of data available (but it’s constantly streaming in). Once the 22nd rolls over I’m going to delete what data we have available for the 21st, and then start backfilling historical data starting with the 21st and going backwards. You should be able to run queries in a Web UI by navigating to https://bigquery.cloud.google.com/dataset/the-psf:pypi (you might have to accept a Terms of Service). The table schema looks like: https://s.caremad.io/lPpTF6rxWZ/ but it should also be visible on the big query page. Some example queries you might want to run are located at https://gist.github.com/alex/4f100a9592b05e9b4d63 (but note, that’s currently using the *old* not publicly available table name, you’ll need to replace [long-stack-762:pypi.downloads] with [the-psf.pypi:downloads]). If you want to write your own queries, you should be able to find the syntax here: https://cloud.google.com/bigquery/query-reference Anyways, new data should constantly be streaming in, and I should be able to backfill data all the way to Jan of 2014 or so. Hopefully this is useful to folks, and if you find any interesting queries or numbers, please share them! — Donald Stufft
On May 21, 2016, at 2:21 PM, Donald Stufft
wrote: So onto what this means.
Oh, one additional tidbit of information- Most of the data is coming from parsing user agents, which means that the ability to get this information depends a lot on what is actually downloading the file. Generally most of the information is only available on pip (and newer pip versions added progressively more information). In cases where we didn’t know we have inserted (null) values (such as bandersnatch mirroring and not knowing what the “python version” is). This means that you’ll often times see (null) values in fields, which just represents downloads using something where we couldn’t determine that information from. — Donald Stufft
- to query, say, a month's worth of data, what would need to be done?
- "sharded by day" ... UTC?
On Saturday, May 21, 2016, Donald Stufft
Hey,
One thing I’ve been working on as part of Warehouse, is a subproject that I call “Linehaul”. This is essentially a little statistics daemon that will take specially formatted syslog messages coming off of Fastly and shove them inside of a BigQuery database. I’m happy to report that I’ve just finished the production deployment of this and we’re now sending every download event that hits Fastly into BigQuery.
First off, I’d like to thank Felipe Hoffa, Will Curran, and Preston Holmes over at Google for helping me get credits for this sorted out so that we can actually get this going! They’ve been a big help.
So onto what this means.
Basically, BigQuery gives us the ability to relatively quickly (typically < 60s) query very large datasets using something that is very similar to SQL. Unlike typical time series databases we don’t have to know ahead of time what we want to query on, we can just insert data into rows in a table (and our tables are sharded by days) and then using the SQLlike query language, you can do any sort of query you like.
On top of all of this, BigQuery gives me the ability to share the dataset publicly with anyone who is logged into a Google account, which means that *anyone* can query this data and look for any sort of interesting information they can find in it. The cost of any queries you run will be associated with your own account (but the first 1TB of data a month that you query is free I believe, nor are you charged for queries that error out or return cached results).
Anyways, you can query these BigQuery tables whose names match the pattern `the-psf:pypi.downloadsYYYYMMDD` and you can see whatever data you want.
We’ve only just started recorded the data in this spot so right now there isn’t a whole lot of data available (but it’s constantly streaming in). Once the 22nd rolls over I’m going to delete what data we have available for the 21st, and then start backfilling historical data starting with the 21st and going backwards. You should be able to run queries in a Web UI by navigating to https://bigquery.cloud.google.com/dataset/the-psf:pypi (you might have to accept a Terms of Service).
The table schema looks like: https://s.caremad.io/lPpTF6rxWZ/ but it should also be visible on the big query page. Some example queries you might want to run are located at https://gist.github.com/alex/4f100a9592b05e9b4d63 (but note, that’s currently using the *old* not publicly available table name, you’ll need to replace [long-stack-762:pypi.downloads] with [the-psf.pypi:downloads]). If you want to write your own queries, you should be able to find the syntax here: https://cloud.google.com/bigquery/query-reference
Anyways, new data should constantly be streaming in, and I should be able to backfill data all the way to Jan of 2014 or so. Hopefully this is useful to folks, and if you find any interesting queries or numbers, please share them!
— Donald Stufft
_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org javascript:; https://mail.python.org/mailman/listinfo/distutils-sig
On May 22, 2016, at 3:39 AM, Wes Turner
wrote: - to query, say, a month's worth of data, what would need to be done? - "sharded by day" ... UTC?
You use a TABLE_DATE_RANGE() function, like this: TABLE_DATE_RANGE([the-psf:pypi.downloads], TIMESTAMP("20160114"), TIMESTAMP("20160214”)) Or, if you wanted to get fancier you could do something like this for the “last 30 days”: TABLE_DATE_RANGE([the-psf:pypi.downloads], DATE_ADD(CURRENT_TIMESTAMP(), -1, "month"), CURRENT_TIMESTAMP()) You can see examples of it in use at https://gist.github.com/alex/4f100a9592b05e9b4d63 https://gist.github.com/alex/4f100a9592b05e9b4d63 or see the query docs at https://cloud.google.com/bigquery/query-reference. — Donald Stufft
thanks!
On Sunday, May 22, 2016, Donald Stufft
On May 22, 2016, at 3:39 AM, Wes Turner
javascript:_e(%7B%7D,'cvml','wes.turner@gmail.com');> wrote: - to query, say, a month's worth of data, what would need to be done? - "sharded by day" ... UTC?
You use a TABLE_DATE_RANGE() function, like this:
TABLE_DATE_RANGE([the-psf:pypi.downloads], TIMESTAMP("20160114"), TIMESTAMP("20160214”))
Or, if you wanted to get fancier you could do something like this for the “last 30 days”:
TABLE_DATE_RANGE([the-psf:pypi.downloads], DATE_ADD(CURRENT_TIMESTAMP(), -1, "month"), CURRENT_TIMESTAMP())
You can see examples of it in use at https://gist.github.com/alex/4f100a9592b05e9b4d63 or see the query docs at https://cloud.google.com/bigquery/query-reference.
— Donald Stufft
On 22 May 2016 at 04:21, Donald Stufft
Hey,
One thing I’ve been working on as part of Warehouse, is a subproject that I call “Linehaul”. This is essentially a little statistics daemon that will take specially formatted syslog messages coming off of Fastly and shove them inside of a BigQuery database. I’m happy to report that I’ve just finished the production deployment of this and we’re now sending every download event that hits Fastly into BigQuery.
First off, I’d like to thank Felipe Hoffa, Will Curran, and Preston Holmes over at Google for helping me get credits for this sorted out so that we can actually get this going! They’ve been a big help.
Great work, folks! It will be interesting to see what kinds of metrics and dashboards folks build on this over time :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (3)
-
Donald Stufft
-
Nick Coghlan
-
Wes Turner