[Pandas-dev] Pandas Deferred Expressions

Tue May 30 18:51:35 EDT 2017

*(My apologies for chiming in here without intending to do any of the
actual work.)*

I wonder if there is a half-solution where a small subset of operations are
lazy much in the same way that the current groupby operations are lazy in
Pandas 0.x.  If this laziness were extended to a small set of mostly linear
operations (element-wise, filters, aggregations, column projections,
groupbys) then that might hit a few of the bigger optimizations that people
care about without going down the full lazy-relational-algebra-in-python
path.  Once you do an operation that is not one of these, we collapse the
lazy dataframe and replace it with a concrete one.  Slowing extending a
small set of operations may also be doable in an incremental fashion as
needed, which might be an easier transition for a community of users.

Of course, half-measures can also cause more maintenance costs long term
and may lack optimizations that Pandas devs find valuable.  I'm unqualified
to judge the merits of any of these solutions, just thought I'd bring this
up.  Feel free to ignore.

On Tue, May 30, 2017 at 6:28 PM, Phillip Cloud <cpcloud at gmail.com> wrote:

> On Tue, May 30, 2017 at 5:19 PM Phillip Cloud <cpcloud at gmail.com> wrote:
>
> Hi all,
>>
>> I'd like to fork part of the thread from Wes's original email about the
>> future of pandas and discuss all things deferred expressions. To start,
>> here's Wes's original thoughts, and a response from Chris Bartak that was
>> in a different thread. After I send this email I'm going to follow up with
>> my own thoughts in a different email so I can address any specific concerns
>> as well as offer up a list of advantages and disadvantages to this approach
>> and lessons learned about building DSLs in Python.
>>
>> *Wes's post:*
>>
>> *TOPIC THREE:* I think we should start developing a "deferred pandas
>> API" that is designed and directly developed by the pandas developer
>> community. From our respective experiences creating expression DSLs and
>> other computation frameworks on top of pandas, I believe this is something
>> where we can build something reasonable and useful. As one concrete problem
>> this would help with: addressing some of the awkwardness around complex
>> groupby-aggregate expressions (custom aggregations would simply be named
>> expressions).
>>
>> The idea of the deferred expression API would be similar to dplyr in R:
>>
>
>> * "True" schemas (we'll have to work around pandas 0.x warts with
>> implicit casts, etc.)
>>
>> * Immutable data structures / no mutation outside "amend" operations that
>> change values by returning new objects
>>
>> * Less index-related stuff in this API (perhaps this is controversial, we
>> shall see)
>>
>> We can create an in-memory backend for "pandas expressions" on pandas
>> 0.x/1.0 and separately create an alternative backend using libpandas (once
>> that is more fully baked / functional) -- this will also help provide a
>> forcing function for implementing analytics that are required for
>> implementing the backend.
>>
>> Distributed execution for us is almost certainly out of scope, and even
>> if so we would probably want to offload onto prior art in Dask or
>> elsewhere. So if the dask.dataframe API and the pandas expression API
>> look different in ways that are unpleasant, we could either compile from
>> pandas -> dask under the hood, or make API changes to make the semantics
>> more conforming.
>>
>> When libpandas / pandas 2.0 is more mature we can consider building
>> stronger out-of-core execution (plenty of prior art we can learn from here,
>> e.g. SFrame).
>>
>> As far as tools to implement the deferred expression API -- I will leave
>> this to discussion. I spent a considerable amount of time making a
>> pandas-like expression API for SQL in Ibis (see https://github.com/
>> cloudera/ibis/tree/master/ibis/expr) while I was at Cloudera, so there's
>> some ideas there (like separating the "internal" AST from the "external"
>> user expressions) that we can learn from, or fork or use some of that
>> expression code in some way. I don't have a strong opinion as long as the
>>  expressions are as strongly-typed as possible (i.e. tables have
>> schemas, operations have checked input and output types) and catch user
>> errors as soon as feasible.
>>
>> *Chris B's response:*
>>
>> Deferred API
>>
>> Mixed thoughts about this.  On the one hand, it's obviously a good thing,
>> enables smarter execution, typing/schemas could result in much easier/safer
>> to write code, etc.
>>
>
>> On the other hand, the pandas API is already massive and reasonably
>> difficult to master, and it's a big ask to learn a new one.  Dask is a good
>> example of how NOT having a new API can be very valuable.  All this to say
>> I think adoption might be pretty low?  Could be my own biases - coming from
>> a "smallish data" user of pandas, I've never found the "write once, execute
>> on different backends" argument especially compelling because I've never
>> had the need.
>>
> I agree with the underlying sentiment in Chris’s post. If we are going to
> build something new, there needs to be very compelling reasons to switch so
> that there’s some offset to the switching costs.
> Benefits I see from using expressions that individual users may find
> convincing:
>
>    1. Code correctness guarantees and API clarity using schemas and types.
>       1. Operations fail very early and tab completion shows you exactly
>       what operations are valid on a particular object.
>    2. Optimizations through expression rewriting (column pruning,
>    predicate pushdown).
>       1. We don’t need to read every column to select just one. Last time
>       I checked nearly all of our IO APIs require reading in all columns to do an
>       operation on just a few.
>    3. Somewhat ironically, a much smaller API to learn.
>       1. No indexes, extremely complex slicing or functions that have
>       many different ways to do the same thing (like our old friend
>       replace).
>
> Reasons that I think individual users will not find convincing:
>
>    1. The ability to run on multiple backends. Many people do not have
>    this problem. I suspect the majority of pandas users do *not* have
>    this problem. We shouldn’t try to convince our users that this is why they
>    should switch, nor should we prioritize this aspect of pandas2.
>
> Potential pitfalls to adoption with using expressions to build pandas2:
>
>    1. Too dissimilar from current pandas.
>    2. Development getting bogged down in lowest common denominator
>    problems (i.e., requiring that every backend implement every operation)
>    resulting in an extremely limited API.
>    3. More abstract execution model, and therefore more difficult to
>    understand and debug errors.
>
> I personally think we should do the following:
>
>    1. Draft a list of “must-have” operations on DataFrames
>    2. Use ibis as a base for building experimental pandas deferred
>    expressions.
>    3. Forget about supporting “all the backends” and focus on SQL and
>    pandas. Make sure that most of our users don’t have to care about this
>    aspect of pandas. The fact that operations are delayed should be almost
>    invisible unless desired. For example, even though we are delaying
>    operations internally, the result should appear to be eagerly evaluated.
>    The model would be: “write once, execute on pandas only by default, nearly
>    invisible to the user”
>    4. Go deep on pandas expressions and add non SQL compatible ones if
>    necessary to preserve as much of the spec’d-out API that we can.
>    5. Try not to break backwards compatibility with SQL backends, but
>    don’t require it if it’s needed for pandas2. Alternatively, we build the
>    pandas backend on top of ibis instead of inside so that we have even more
>    freedom.
>
> I’ve got a patch up that implements some of the pandas API in ibis here
> <https://github.com/pandas-dev/ibis/pull/981>, if anyone would like to
> follow along.
>
> -Phillip
> 
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20170530/974d08e3/attachment-0001.html>