<div dir="ltr"><div>Do take a look in the fairly recent archives of this list for a big discussion of groupby -- it kind of petered out but there were a couple options on the table.</div><div><br></div><div>-CHB</div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Feb 10, 2019 at 9:23 PM Kyle Lahnakoski <<a href="mailto:klahnakoski@mozilla.com">klahnakoski@mozilla.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
On 2019-02-10 18:30, Steven D'Aprano wrote:<br>
><br>
> Can you post a simplified example of how you would do it in SQL, <br>
> compared to what you would have to do in standard Python?<br>
<br>
Can I do the same in standard Python? If I did, then I would use Pandas:<br>
it has groupby, and some primitive joining, and window functions may<br>
come naturally because of its imperative nature, but I have not tried<br>
it. If I can not use Pandas, then I would write the groupby and window<br>
functions and call them in sequence. This is similar to what you see in<br>
my code now: a number of properties who's values get dispatched to<br>
Python functions. My code is more complicated only because those<br>
structures can be dispatched to translators for databases too.<br>
<br>
I am certain there are many variations of groupby out in the wild, and<br>
it would be nice to have the concept standardized when/if Python has<br>
vector operations. Join would be nice to have too, but I do not use it<br>
much; dictionary lookup seems to fill that need. Window functions<br>
(which are like mini queries) are powerful, but like Pandas, may end up<br>
end up being free because Python is imperative.<br>
<br>
My code I pointed to has two parts. Here is the first part in SQL (well,<br>
an approximation of SQL since I did not test this, and now I am rusty). <br>
A detailed description is below<br>
<br>
| WITH time_range AS (<br>
| SELECT<br>
| num<br>
| FROM<br>
| all_integers<br>
| WHERE<br>
| num % 60 =0 AND<br>
| num >= floor(<<now>>/60/60)*60*60-<<start_Of_history>> AND<br>
| num < floor(<<now>>/60/60) + 60*60<br>
| )<br>
| SELECT <br>
| availability_zone,<br>
| instance_type,<br>
| time_range.num AS time<br>
| MAX(price) as PRICE,<br>
| COUNT(1) AS `COUNT`,<br>
| LAST(current_price) OVER (<br>
| PARTITION BY<br>
| availability_zone,<br>
| instance_type<br>
| ORDER BY<br>
| timestamp<br>
| ) AS current_price<br>
| FROM<br>
| (<br>
| SELECT<br>
| *,<br>
| COALESCE(LAG(timestampvalue, 1), <<end_of_day>>) OVER (<br>
| PARTITION BY<br>
| availability_zone,<br>
| instance_type<br>
| ORDER BY<br>
| timestamp<br>
| ) AS expire,<br>
| timestamp-<<expected_uptime>> AS effective<br>
| FROM<br>
| prices<br>
| ) temp<br>
| RIGHT JOIN<br>
| time_range ON time_range.num BETWEEN temp.effective AND temp.expire<br>
| GROUP BY<br>
| availability_zone,<br>
| instance_type,<br>
| time_range.num AS time<br>
| WHERE<br>
| expire > floor(<<now>>/60/60)*60*60 - <<start_Of_history>><br>
<br>
<br>
Now, for the same, with description:<br>
<br>
This WITH clause is not real SQL; it is meant to stand in for a<br>
temporary table that contains all hours of the time range I am<br>
interested. Definitely easier to do in Python. All time is assumed to be<br>
in seconds since epoch.<br>
<br>
| WITH time_range AS (<br>
| SELECT<br>
| num<br>
| FROM<br>
| all_integers<br>
| WHERE<br>
| num % 60 =0 AND<br>
| num >= floor(<<now>>/60/60)*60*60-<<start_of_history>> AND<br>
| num < floor(<<now>>/60/60) + 60*60<br>
| )<br>
<br>
We will select the three dimensions we are interested in (see GROUP BY<br>
below), along with the MAX price we have seen in the given hour, and the<br>
current_price for any (availability_zone, instance_type) pair.<br>
<br>
| SELECT <br>
| availability_zone,<br>
| instance_type,<br>
| time_range.num AS time<br>
| MAX(price) as PRICE,<br>
| COUNT(1) AS `COUNT`,<br>
| LAST(current_price) OVER (<br>
| PARTITION BY<br>
| availability_zone,<br>
| instance_type<br>
| ORDER BY<br>
| timestamp<br>
| ) AS current_price<br>
| FROM<br>
<br>
The prices coming from Amazon only have a timestamp for when that price<br>
is effective; so this sub-query adds an `effective` start time, and an<br>
`expire` time so the rest of the query need only deal with ranges. The <br>
timestamp-<<expected_uptime>> is putting the start time back further<br>
into the past so the past can "see" future pricing. <br>
<br>
| (<br>
| SELECT<br>
| *,<br>
| COALESCE(LAG(timestamp, 1), <<end_of_day>>) OVER (<br>
| PARTITION BY<br>
| availability_zone,<br>
| instance_type<br>
| ORDER BY<br>
| timestamp<br>
| ) AS expire,<br>
| timestamp-<<expected_uptime>> AS effective<br>
| FROM<br>
| prices<br>
| ) temp<br>
<br>
This is the point where we use the time_range from above and find every<br>
hour a price is effective. This could have been a sub-query, but I am<br>
rusty at SQL<br>
<br>
| RIGHT JOIN<br>
| time_range ON time_range.num BETWEEN temp.effective AND temp.expire<br>
<br>
These are the three dimensions we are interested in<br>
<br>
| GROUP BY<br>
| availability_zone,<br>
| instance_type,<br>
| time_range.num AS time<br>
<br>
and we are only interested in calculating back to a certain point<br>
<br>
| WHERE<br>
| expire > floor(<<now>>/60/60)*60*60 - <<start_Of_history>><br>
<br>
<br>
<br>
<br>
<br>
<br>
_______________________________________________<br>
Python-ideas mailing list<br>
<a href="mailto:Python-ideas@python.org" target="_blank">Python-ideas@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/python-ideas" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/python-ideas</a><br>
Code of Conduct: <a href="http://python.org/psf/codeofconduct/" rel="noreferrer" target="_blank">http://python.org/psf/codeofconduct/</a><br>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature">Christopher Barker, PhD<br><br> Python Language Consulting<br> - Teaching<br> - Scientific Software Development<br> - Desktop GUI and Web Development<br> - wxPython, numpy, scipy, Cython<br></div>