Hi, Making download stats available through BigQuery seems like a good idea, but as it is currently this seems a bit expensive on a user's end. For example, I just looked at stats for a package of mine and 840GB are reported to be processed (therefore billed) in one query. Is table clustering used when building these tables ? If not, could we have it at least on projects name ? (See the following about table clustering: https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f... ). Best, Laurent
Hi Laurent, and thanks for asking. I re-clustered the tables - find my work-log and notes here: - https://medium.com/@hoffa/python-pypi-stats-in-bigquery-reclustered-d80e583e... If you use my tables, a query that used to process 200.88GB is now only scanning 9.65GB - when filtering for a particular package. 95% reductions! For example: SELECT TIMESTAMP_TRUNC(timestamp, WEEK) week , REGEXP_EXTRACT(details.python, r'^\d*\.\d*') python , COUNT(*) downloads FROM `the-psf.pypi.downloads2017*` WHERE file.project='pyspark' GROUP BY week, python HAVING python != '3.6' AND week<'2017-12-30' ORDER BY week
Bump. Is this the right list to post this kind of question (and implicit RFC) ? Le dim. 21 oct. 2018 à 10:55, Laurent Gautier <lgautier@gmail.com> a écrit :
Hi,
Making download stats available through BigQuery seems like a good idea, but as it is currently this seems a bit expensive on a user's end. For example, I just looked at stats for a package of mine and 840GB are reported to be processed (therefore billed) in one query.
Is table clustering used when building these tables ? If not, could we have it at least on projects name ? (See the following about table clustering: https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f... ).
Best,
Laurent
participants (2)
-
Felipe Hoffa
-
Laurent Gautier