
Hi,
Making download stats available through BigQuery seems like a good idea, but as it is currently this seems a bit expensive on a user's end. For example, I just looked at stats for a package of mine and 840GB are reported to be processed (therefore billed) in one query.
Is table clustering used when building these tables ? If not, could we have it at least on projects name ? (See the following about table clustering: https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f... ).
Best,
Laurent

Hi Laurent, and thanks for asking.
I re-clustered the tables - find my work-log and notes here:
- https://medium.com/@hoffa/python-pypi-stats-in-bigquery-reclustered-d80e583e...
If you use my tables, a query that used to process 200.88GB is now only scanning 9.65GB - when filtering for a particular package. 95% reductions!
For example:
SELECT TIMESTAMP_TRUNC(timestamp, WEEK) week , REGEXP_EXTRACT(details.python, r'^\d*\.\d*') python , COUNT(*) downloads FROM `the-psf.pypi.downloads2017*` WHERE file.project='pyspark' GROUP BY week, python HAVING python != '3.6' AND week<'2017-12-30' ORDER BY week

Bump.
Is this the right list to post this kind of question (and implicit RFC) ?
Le dim. 21 oct. 2018 à 10:55, Laurent Gautier lgautier@gmail.com a écrit :
Hi,
Making download stats available through BigQuery seems like a good idea, but as it is currently this seems a bit expensive on a user's end. For example, I just looked at stats for a package of mine and 840GB are reported to be processed (therefore billed) in one query.
Is table clustering used when building these tables ? If not, could we have it at least on projects name ? (See the following about table clustering: https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f... ).
Best,
Laurent
participants (2)
-
Felipe Hoffa
-
Laurent Gautier