[Python-ideas] Vectorization [was Re: Add list.join() please]
Christopher Barker
pythonchb at gmail.com
Mon Feb 11 10:43:12 EST 2019
Do take a look in the fairly recent archives of this list for a big
discussion of groupby -- it kind of petered out but there were a couple
options on the table.
-CHB
On Sun, Feb 10, 2019 at 9:23 PM Kyle Lahnakoski <klahnakoski at mozilla.com>
wrote:
>
> On 2019-02-10 18:30, Steven D'Aprano wrote:
> >
> > Can you post a simplified example of how you would do it in SQL,
> > compared to what you would have to do in standard Python?
>
> Can I do the same in standard Python? If I did, then I would use Pandas:
> it has groupby, and some primitive joining, and window functions may
> come naturally because of its imperative nature, but I have not tried
> it. If I can not use Pandas, then I would write the groupby and window
> functions and call them in sequence. This is similar to what you see in
> my code now: a number of properties who's values get dispatched to
> Python functions. My code is more complicated only because those
> structures can be dispatched to translators for databases too.
>
> I am certain there are many variations of groupby out in the wild, and
> it would be nice to have the concept standardized when/if Python has
> vector operations. Join would be nice to have too, but I do not use it
> much; dictionary lookup seems to fill that need. Window functions
> (which are like mini queries) are powerful, but like Pandas, may end up
> end up being free because Python is imperative.
>
> My code I pointed to has two parts. Here is the first part in SQL (well,
> an approximation of SQL since I did not test this, and now I am rusty).
> A detailed description is below
>
> | WITH time_range AS (
> | SELECT
> | num
> | FROM
> | all_integers
> | WHERE
> | num % 60 =0 AND
> | num >= floor(<<now>>/60/60)*60*60-<<start_Of_history>> AND
> | num < floor(<<now>>/60/60) + 60*60
> | )
> | SELECT
> | availability_zone,
> | instance_type,
> | time_range.num AS time
> | MAX(price) as PRICE,
> | COUNT(1) AS `COUNT`,
> | LAST(current_price) OVER (
> | PARTITION BY
> | availability_zone,
> | instance_type
> | ORDER BY
> | timestamp
> | ) AS current_price
> | FROM
> | (
> | SELECT
> | *,
> | COALESCE(LAG(timestampvalue, 1), <<end_of_day>>) OVER (
> | PARTITION BY
> | availability_zone,
> | instance_type
> | ORDER BY
> | timestamp
> | ) AS expire,
> | timestamp-<<expected_uptime>> AS effective
> | FROM
> | prices
> | ) temp
> | RIGHT JOIN
> | time_range ON time_range.num BETWEEN temp.effective AND temp.expire
> | GROUP BY
> | availability_zone,
> | instance_type,
> | time_range.num AS time
> | WHERE
> | expire > floor(<<now>>/60/60)*60*60 - <<start_Of_history>>
>
>
> Now, for the same, with description:
>
> This WITH clause is not real SQL; it is meant to stand in for a
> temporary table that contains all hours of the time range I am
> interested. Definitely easier to do in Python. All time is assumed to be
> in seconds since epoch.
>
> | WITH time_range AS (
> | SELECT
> | num
> | FROM
> | all_integers
> | WHERE
> | num % 60 =0 AND
> | num >= floor(<<now>>/60/60)*60*60-<<start_of_history>> AND
> | num < floor(<<now>>/60/60) + 60*60
> | )
>
> We will select the three dimensions we are interested in (see GROUP BY
> below), along with the MAX price we have seen in the given hour, and the
> current_price for any (availability_zone, instance_type) pair.
>
> | SELECT
> | availability_zone,
> | instance_type,
> | time_range.num AS time
> | MAX(price) as PRICE,
> | COUNT(1) AS `COUNT`,
> | LAST(current_price) OVER (
> | PARTITION BY
> | availability_zone,
> | instance_type
> | ORDER BY
> | timestamp
> | ) AS current_price
> | FROM
>
> The prices coming from Amazon only have a timestamp for when that price
> is effective; so this sub-query adds an `effective` start time, and an
> `expire` time so the rest of the query need only deal with ranges. The
> timestamp-<<expected_uptime>> is putting the start time back further
> into the past so the past can "see" future pricing.
>
> | (
> | SELECT
> | *,
> | COALESCE(LAG(timestamp, 1), <<end_of_day>>) OVER (
> | PARTITION BY
> | availability_zone,
> | instance_type
> | ORDER BY
> | timestamp
> | ) AS expire,
> | timestamp-<<expected_uptime>> AS effective
> | FROM
> | prices
> | ) temp
>
> This is the point where we use the time_range from above and find every
> hour a price is effective. This could have been a sub-query, but I am
> rusty at SQL
>
> | RIGHT JOIN
> | time_range ON time_range.num BETWEEN temp.effective AND temp.expire
>
> These are the three dimensions we are interested in
>
> | GROUP BY
> | availability_zone,
> | instance_type,
> | time_range.num AS time
>
> and we are only interested in calculating back to a certain point
>
> | WHERE
> | expire > floor(<<now>>/60/60)*60*60 - <<start_Of_history>>
>
>
>
>
>
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
--
Christopher Barker, PhD
Python Language Consulting
- Teaching
- Scientific Software Development
- Desktop GUI and Web Development
- wxPython, numpy, scipy, Cython
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20190211/ec3d2184/attachment-0001.html>
More information about the Python-ideas
mailing list