<div dir="ltr"><div>Do take a look in the fairly recent archives of this list for a big discussion of groupby -- it kind of petered out but there were a couple options on the table.</div><div><br></div><div>-CHB</div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Feb 10, 2019 at 9:23 PM Kyle Lahnakoski <<a href="mailto:klahnakoski@mozilla.com">klahnakoski@mozilla.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>


On 2019-02-10 18:30, Steven D'Aprano wrote:<br>


><br>


> Can you post a simplified example of how you would do it in SQL, <br>


> compared to what you would have to do in standard Python?<br>


<br>


Can I do the same in standard Python? If I did, then I would use Pandas:<br>


it has groupby, and some primitive joining, and window functions may<br>


come naturally because of its imperative nature, but I have not tried<br>


it.  If I can not use Pandas, then I would write the groupby and window<br>


functions and call them in sequence. This is similar to what you see in<br>


my code now: a number of properties who's values get dispatched to<br>


Python functions.  My code is more complicated only because those<br>


structures can be dispatched to translators for databases too.<br>


<br>


I am certain there are many variations of groupby out in the wild, and<br>


it would be nice to have the concept standardized when/if Python has<br>


vector operations. Join would be nice to have too, but I do not use it<br>


much; dictionary lookup seems to fill that need.  Window functions<br>


(which are like mini queries) are powerful, but like Pandas, may end up<br>


end up being free because Python is imperative.<br>


<br>


My code I pointed to has two parts. Here is the first part in SQL (well,<br>


an approximation of SQL since I did not test this, and now I am rusty). <br>


A detailed description is below<br>


<br>


|   WITH time_range AS (<br>


|       SELECT<br>


|           num<br>


|       FROM<br>


|           all_integers<br>


|       WHERE<br>


|           num % 60 =0 AND<br>


|           num >= floor(<<now>>/60/60)*60*60-<<start_Of_history>> AND<br>


|           num < floor(<<now>>/60/60) + 60*60<br>


|   )<br>


|   SELECT <br>


|       availability_zone,<br>


|       instance_type,<br>


|       time_range.num AS time<br>


|       MAX(price) as PRICE,<br>


|       COUNT(1) AS `COUNT`,<br>


|       LAST(current_price) OVER (<br>


|           PARTITION BY<br>


|               availability_zone,<br>


|               instance_type<br>


|           ORDER BY<br>


|               timestamp<br>


|       ) AS current_price<br>


|   FROM<br>


|       (<br>


|           SELECT<br>


|               *,<br>


|               COALESCE(LAG(timestampvalue, 1), <<end_of_day>>) OVER (<br>


|                   PARTITION BY<br>


|                       availability_zone,<br>


|                       instance_type<br>


|                   ORDER BY<br>


|                       timestamp<br>


|               ) AS expire,<br>


|               timestamp-<<expected_uptime>> AS effective<br>


|           FROM<br>


|               prices<br>


|       ) temp<br>


|   RIGHT JOIN<br>


|       time_range ON time_range.num BETWEEN temp.effective AND temp.expire<br>


|   GROUP BY<br>


|       availability_zone,<br>


|       instance_type,<br>


|       time_range.num AS time<br>


|   WHERE<br>


|       expire > floor(<<now>>/60/60)*60*60 - <<start_Of_history>><br>


<br>


<br>


Now, for the same, with description:<br>


<br>


This WITH clause is not real SQL; it is meant to stand in for a<br>


temporary table that contains all hours of the time range I am<br>


interested. Definitely easier to do in Python. All time is assumed to be<br>


in seconds since epoch.<br>


<br>


|   WITH time_range AS (<br>


|       SELECT<br>


|           num<br>


|       FROM<br>


|           all_integers<br>


|       WHERE<br>


|           num % 60 =0 AND<br>


|           num >= floor(<<now>>/60/60)*60*60-<<start_of_history>> AND<br>


|           num < floor(<<now>>/60/60) + 60*60<br>


|   )<br>


<br>


We will select the three dimensions we are interested in (see GROUP BY<br>


below), along with the MAX price we have seen in the given hour, and the<br>


current_price for any (availability_zone, instance_type) pair.<br>


<br>


|   SELECT <br>


|       availability_zone,<br>


|       instance_type,<br>


|       time_range.num AS time<br>


|       MAX(price) as PRICE,<br>


|       COUNT(1) AS `COUNT`,<br>


|       LAST(current_price) OVER (<br>


|           PARTITION BY<br>


|               availability_zone,<br>


|               instance_type<br>


|           ORDER BY<br>


|               timestamp<br>


|       ) AS current_price<br>


|   FROM<br>


<br>


The prices coming from Amazon only have a timestamp for when that price<br>


is effective; so this sub-query adds an `effective` start time, and an<br>


`expire` time so the rest of the query need only deal with ranges. The <br>


timestamp-<<expected_uptime>> is putting the start time back further<br>


into the past so the past can "see" future pricing. <br>


<br>


|       (<br>


|           SELECT<br>


|               *,<br>


|               COALESCE(LAG(timestamp, 1), <<end_of_day>>) OVER (<br>


|                   PARTITION BY<br>


|                       availability_zone,<br>


|                       instance_type<br>


|                   ORDER BY<br>


|                       timestamp<br>


|               ) AS expire,<br>


|               timestamp-<<expected_uptime>> AS effective<br>


|           FROM<br>


|               prices<br>


|       ) temp<br>


<br>


This is the point where we use the time_range from above and find every<br>


hour a price is effective.  This could have been a sub-query, but I am<br>


rusty at SQL<br>


<br>


|   RIGHT JOIN<br>


|       time_range ON time_range.num BETWEEN temp.effective AND temp.expire<br>


<br>


These are the three dimensions we are interested in<br>


<br>


|   GROUP BY<br>


|       availability_zone,<br>


|       instance_type,<br>


|       time_range.num AS time<br>


<br>


and we are only interested in calculating back to a certain point<br>


<br>


|   WHERE<br>


|       expire > floor(<<now>>/60/60)*60*60 - <<start_Of_history>><br>


<br>


<br>


<br>


<br>


<br>


<br>


_______________________________________________<br>


Python-ideas mailing list<br>


<a href="mailto:Python-ideas@python.org" target="_blank">Python-ideas@python.org</a><br>


<a href="https://mail.python.org/mailman/listinfo/python-ideas" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/python-ideas</a><br>


Code of Conduct: <a href="http://python.org/psf/codeofconduct/" rel="noreferrer" target="_blank">http://python.org/psf/codeofconduct/</a><br>


</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature">Christopher Barker, PhD<br><br> Python Language Consulting<br>  - Teaching<br>  - Scientific Software Development<br>  - Desktop GUI and Web Development<br>  - wxPython, numpy, scipy, Cython<br></div>