[Python-ideas] Vectorization [was Re: Add list.join() please]

Mon Feb 11 10:43:12 EST 2019

Do take a look in the fairly recent archives of this list for a big
discussion of groupby -- it kind of petered out but there were a couple
options on the table.

-CHB

On Sun, Feb 10, 2019 at 9:23 PM Kyle Lahnakoski <klahnakoski at mozilla.com>
wrote:

>
> On 2019-02-10 18:30, Steven D'Aprano wrote:
> >
> > Can you post a simplified example of how you would do it in SQL,
> > compared to what you would have to do in standard Python?
>
> Can I do the same in standard Python? If I did, then I would use Pandas:
> it has groupby, and some primitive joining, and window functions may
> come naturally because of its imperative nature, but I have not tried
> it.  If I can not use Pandas, then I would write the groupby and window
> functions and call them in sequence. This is similar to what you see in
> my code now: a number of properties who's values get dispatched to
> Python functions.  My code is more complicated only because those
> structures can be dispatched to translators for databases too.
>
> I am certain there are many variations of groupby out in the wild, and
> it would be nice to have the concept standardized when/if Python has
> vector operations. Join would be nice to have too, but I do not use it
> much; dictionary lookup seems to fill that need.  Window functions
> (which are like mini queries) are powerful, but like Pandas, may end up
> end up being free because Python is imperative.
>
> My code I pointed to has two parts. Here is the first part in SQL (well,
> an approximation of SQL since I did not test this, and now I am rusty).
> A detailed description is below
>
> |   WITH time_range AS (
> |       SELECT
> |           num
> |       FROM
> |           all_integers
> |       WHERE
> |           num % 60 =0 AND
> |           num >= floor(<<now>>/60/60)*60*60-<<start_Of_history>> AND
> |           num < floor(<<now>>/60/60) + 60*60
> |   )
> |   SELECT
> |       availability_zone,
> |       instance_type,
> |       time_range.num AS time
> |       MAX(price) as PRICE,
> |       COUNT(1) AS `COUNT`,
> |       LAST(current_price) OVER (
> |           PARTITION BY
> |               availability_zone,
> |               instance_type
> |           ORDER BY
> |               timestamp
> |       ) AS current_price
> |   FROM
> |       (
> |           SELECT
> |               *,
> |               COALESCE(LAG(timestampvalue, 1), <<end_of_day>>) OVER (
> |                   PARTITION BY
> |                       availability_zone,
> |                       instance_type
> |                   ORDER BY
> |                       timestamp
> |               ) AS expire,
> |               timestamp-<<expected_uptime>> AS effective
> |           FROM
> |               prices
> |       ) temp
> |   RIGHT JOIN
> |       time_range ON time_range.num BETWEEN temp.effective AND temp.expire
> |   GROUP BY
> |       availability_zone,
> |       instance_type,
> |       time_range.num AS time
> |   WHERE
> |       expire > floor(<<now>>/60/60)*60*60 - <<start_Of_history>>
>
>
> Now, for the same, with description:
>
> This WITH clause is not real SQL; it is meant to stand in for a
> temporary table that contains all hours of the time range I am
> interested. Definitely easier to do in Python. All time is assumed to be
> in seconds since epoch.
>
> |   WITH time_range AS (
> |       SELECT
> |           num
> |       FROM
> |           all_integers
> |       WHERE
> |           num % 60 =0 AND
> |           num >= floor(<<now>>/60/60)*60*60-<<start_of_history>> AND
> |           num < floor(<<now>>/60/60) + 60*60
> |   )
>
> We will select the three dimensions we are interested in (see GROUP BY
> below), along with the MAX price we have seen in the given hour, and the
> current_price for any (availability_zone, instance_type) pair.
>
> |   SELECT
> |       availability_zone,
> |       instance_type,
> |       time_range.num AS time
> |       MAX(price) as PRICE,
> |       COUNT(1) AS `COUNT`,
> |       LAST(current_price) OVER (
> |           PARTITION BY
> |               availability_zone,
> |               instance_type
> |           ORDER BY
> |               timestamp
> |       ) AS current_price
> |   FROM
>
> The prices coming from Amazon only have a timestamp for when that price
> is effective; so this sub-query adds an `effective` start time, and an
> `expire` time so the rest of the query need only deal with ranges. The
> timestamp-<<expected_uptime>> is putting the start time back further
> into the past so the past can "see" future pricing.
>
> |       (
> |           SELECT
> |               *,
> |               COALESCE(LAG(timestamp, 1), <<end_of_day>>) OVER (
> |                   PARTITION BY
> |                       availability_zone,
> |                       instance_type
> |                   ORDER BY
> |                       timestamp
> |               ) AS expire,
> |               timestamp-<<expected_uptime>> AS effective
> |           FROM
> |               prices
> |       ) temp
>
> This is the point where we use the time_range from above and find every
> hour a price is effective.  This could have been a sub-query, but I am
> rusty at SQL
>
> |   RIGHT JOIN
> |       time_range ON time_range.num BETWEEN temp.effective AND temp.expire
>
> These are the three dimensions we are interested in
>
> |   GROUP BY
> |       availability_zone,
> |       instance_type,
> |       time_range.num AS time
>
> and we are only interested in calculating back to a certain point
>
> |   WHERE
> |       expire > floor(<<now>>/60/60)*60*60 - <<start_Of_history>>
>
>
>
>
>
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>

-- 
Christopher Barker, PhD

Python Language Consulting
  - Teaching
  - Scientific Software Development
  - Desktop GUI and Web Development
  - wxPython, numpy, scipy, Cython
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20190211/ec3d2184/attachment-0001.html>