Do take a look in the fairly recent archives of this list for a big discussion of groupby -- it kind of petered out but there were a couple options on the table.

-CHB



On Sun, Feb 10, 2019 at 9:23 PM Kyle Lahnakoski <klahnakoski@mozilla.com> wrote:

On 2019-02-10 18:30, Steven D'Aprano wrote:
>
> Can you post a simplified example of how you would do it in SQL,
> compared to what you would have to do in standard Python?

Can I do the same in standard Python? If I did, then I would use Pandas:
it has groupby, and some primitive joining, and window functions may
come naturally because of its imperative nature, but I have not tried
it.  If I can not use Pandas, then I would write the groupby and window
functions and call them in sequence. This is similar to what you see in
my code now: a number of properties who's values get dispatched to
Python functions.  My code is more complicated only because those
structures can be dispatched to translators for databases too.

I am certain there are many variations of groupby out in the wild, and
it would be nice to have the concept standardized when/if Python has
vector operations. Join would be nice to have too, but I do not use it
much; dictionary lookup seems to fill that need.  Window functions
(which are like mini queries) are powerful, but like Pandas, may end up
end up being free because Python is imperative.

My code I pointed to has two parts. Here is the first part in SQL (well,
an approximation of SQL since I did not test this, and now I am rusty). 
A detailed description is below

|   WITH time_range AS (
|       SELECT
|           num
|       FROM
|           all_integers
|       WHERE
|           num % 60 =0 AND
|           num >= floor(<<now>>/60/60)*60*60-<<start_Of_history>> AND
|           num < floor(<<now>>/60/60) + 60*60
|   )
|   SELECT 
|       availability_zone,
|       instance_type,
|       time_range.num AS time
|       MAX(price) as PRICE,
|       COUNT(1) AS `COUNT`,
|       LAST(current_price) OVER (
|           PARTITION BY
|               availability_zone,
|               instance_type
|           ORDER BY
|               timestamp
|       ) AS current_price
|   FROM
|       (
|           SELECT
|               *,
|               COALESCE(LAG(timestampvalue, 1), <<end_of_day>>) OVER (
|                   PARTITION BY
|                       availability_zone,
|                       instance_type
|                   ORDER BY
|                       timestamp
|               ) AS expire,
|               timestamp-<<expected_uptime>> AS effective
|           FROM
|               prices
|       ) temp
|   RIGHT JOIN
|       time_range ON time_range.num BETWEEN temp.effective AND temp.expire
|   GROUP BY
|       availability_zone,
|       instance_type,
|       time_range.num AS time
|   WHERE
|       expire > floor(<<now>>/60/60)*60*60 - <<start_Of_history>>


Now, for the same, with description:

This WITH clause is not real SQL; it is meant to stand in for a
temporary table that contains all hours of the time range I am
interested. Definitely easier to do in Python. All time is assumed to be
in seconds since epoch.

|   WITH time_range AS (
|       SELECT
|           num
|       FROM
|           all_integers
|       WHERE
|           num % 60 =0 AND
|           num >= floor(<<now>>/60/60)*60*60-<<start_of_history>> AND
|           num < floor(<<now>>/60/60) + 60*60
|   )

We will select the three dimensions we are interested in (see GROUP BY
below), along with the MAX price we have seen in the given hour, and the
current_price for any (availability_zone, instance_type) pair.

|   SELECT 
|       availability_zone,
|       instance_type,
|       time_range.num AS time
|       MAX(price) as PRICE,
|       COUNT(1) AS `COUNT`,
|       LAST(current_price) OVER (
|           PARTITION BY
|               availability_zone,
|               instance_type
|           ORDER BY
|               timestamp
|       ) AS current_price
|   FROM

The prices coming from Amazon only have a timestamp for when that price
is effective; so this sub-query adds an `effective` start time, and an
`expire` time so the rest of the query need only deal with ranges. The 
timestamp-<<expected_uptime>> is putting the start time back further
into the past so the past can "see" future pricing. 

|       (
|           SELECT
|               *,
|               COALESCE(LAG(timestamp, 1), <<end_of_day>>) OVER (
|                   PARTITION BY
|                       availability_zone,
|                       instance_type
|                   ORDER BY
|                       timestamp
|               ) AS expire,
|               timestamp-<<expected_uptime>> AS effective
|           FROM
|               prices
|       ) temp

This is the point where we use the time_range from above and find every
hour a price is effective.  This could have been a sub-query, but I am
rusty at SQL

|   RIGHT JOIN
|       time_range ON time_range.num BETWEEN temp.effective AND temp.expire

These are the three dimensions we are interested in

|   GROUP BY
|       availability_zone,
|       instance_type,
|       time_range.num AS time

and we are only interested in calculating back to a certain point

|   WHERE
|       expire > floor(<<now>>/60/60)*60*60 - <<start_Of_history>>






_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


--
Christopher Barker, PhD

Python Language Consulting
  - Teaching
  - Scientific Software Development
  - Desktop GUI and Web Development
  - wxPython, numpy, scipy, Cython