Do take a look in the fairly recent archives of this list for a big
discussion of groupby -- it kind of petered out but there were a couple
options on the table.
-CHB
On Sun, Feb 10, 2019 at 9:23 PM Kyle Lahnakoski
On 2019-02-10 18:30, Steven D'Aprano wrote:
Can you post a simplified example of how you would do it in SQL, compared to what you would have to do in standard Python?
Can I do the same in standard Python? If I did, then I would use Pandas: it has groupby, and some primitive joining, and window functions may come naturally because of its imperative nature, but I have not tried it. If I can not use Pandas, then I would write the groupby and window functions and call them in sequence. This is similar to what you see in my code now: a number of properties who's values get dispatched to Python functions. My code is more complicated only because those structures can be dispatched to translators for databases too.
I am certain there are many variations of groupby out in the wild, and it would be nice to have the concept standardized when/if Python has vector operations. Join would be nice to have too, but I do not use it much; dictionary lookup seems to fill that need. Window functions (which are like mini queries) are powerful, but like Pandas, may end up end up being free because Python is imperative.
My code I pointed to has two parts. Here is the first part in SQL (well, an approximation of SQL since I did not test this, and now I am rusty). A detailed description is below
| WITH time_range AS ( | SELECT | num | FROM | all_integers | WHERE | num % 60 =0 AND | num >= floor(<<now>>/60/60)*60*60-<
> AND | num < floor(<<now>>/60/60) + 60*60 | ) | SELECT | availability_zone, | instance_type, | time_range.num AS time | MAX(price) as PRICE, | COUNT(1) AS `COUNT`, | LAST(current_price) OVER ( | PARTITION BY | availability_zone, | instance_type | ORDER BY | timestamp | ) AS current_price | FROM | ( | SELECT | *, | COALESCE(LAG(timestampvalue, 1), < >) OVER ( | PARTITION BY | availability_zone, | instance_type | ORDER BY | timestamp | ) AS expire, | timestamp-< > AS effective | FROM | prices | ) temp | RIGHT JOIN | time_range ON time_range.num BETWEEN temp.effective AND temp.expire | GROUP BY | availability_zone, | instance_type, | time_range.num AS time | WHERE | expire > floor(<<now>>/60/60)*60*60 - < > Now, for the same, with description:
This WITH clause is not real SQL; it is meant to stand in for a temporary table that contains all hours of the time range I am interested. Definitely easier to do in Python. All time is assumed to be in seconds since epoch.
| WITH time_range AS ( | SELECT | num | FROM | all_integers | WHERE | num % 60 =0 AND | num >= floor(<<now>>/60/60)*60*60-<
> AND | num < floor(<<now>>/60/60) + 60*60 | ) We will select the three dimensions we are interested in (see GROUP BY below), along with the MAX price we have seen in the given hour, and the current_price for any (availability_zone, instance_type) pair.
| SELECT | availability_zone, | instance_type, | time_range.num AS time | MAX(price) as PRICE, | COUNT(1) AS `COUNT`, | LAST(current_price) OVER ( | PARTITION BY | availability_zone, | instance_type | ORDER BY | timestamp | ) AS current_price | FROM
The prices coming from Amazon only have a timestamp for when that price is effective; so this sub-query adds an `effective` start time, and an `expire` time so the rest of the query need only deal with ranges. The timestamp-<
> is putting the start time back further into the past so the past can "see" future pricing. | ( | SELECT | *, | COALESCE(LAG(timestamp, 1), <
>) OVER ( | PARTITION BY | availability_zone, | instance_type | ORDER BY | timestamp | ) AS expire, | timestamp-< > AS effective | FROM | prices | ) temp This is the point where we use the time_range from above and find every hour a price is effective. This could have been a sub-query, but I am rusty at SQL
| RIGHT JOIN | time_range ON time_range.num BETWEEN temp.effective AND temp.expire
These are the three dimensions we are interested in
| GROUP BY | availability_zone, | instance_type, | time_range.num AS time
and we are only interested in calculating back to a certain point
| WHERE | expire > floor(<<now>>/60/60)*60*60 - <
> _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython