[Pandas-dev] Developing libpandas as a separate codebase + Componentization + Deferred "pandas expressions"

Wed May 3 19:09:43 EDT 2017

hi folks,

Bit of a multi-tiered discussion, but it's all somewhat related so putting
it all in one e-mail.

*TOPIC ONE:* I have been thinking about how to proceed with pandas 2.0
development in a sane way with the following goals:

* Delivering some incrementally valuable functionality to production pandas
users (e.g. faster CSV reader, some faster algorithms). There might be
faster multithreaded code we can make available via a memory layout
conversion layer (NaNs to bitmaps, etc.)

* Being able to install subcomponents of pandas 2.0 (like libpandas)
alongside production pandas to get feedback from users, particularly around
low-level data semantics (copy-on-write, etc.)

* Migrating compiled code and utility algorithms out of current pandas
codebase

Changing the internals of Series and DataFrame is going to be a difficult
process (and frankly, it would be easier to build a brand new project, but
I am not going to advocate for that). But I think one way we can make
things easier is by developing "libpandas" and its Python bindings as a
separate codebase.

What goes in libpandas? In my view:

* The semantic contents of pandas._lib
* New "guts" of Series and DataFrame, what I've been colloquially calling
pandas.Array (Series with no Index) and pandas.Table (DataFrame with no
index)
* New implementations of Index, based on libpandas.Array
* Computational libraries that presume a particular memory layout:
pandas.core.algorithms, pandas.core.ops, pandas.core.nanops, etc.
* Low-level IO code (moving data from other formats into new pandas data
structures)

The idea is that libpandas would (someday) be a hard dependency of pandas,
and contain most or all of the compiled code in pandas. To simplify things
for most contributors, we could publish nightly dev wheels or conda
packages so that you can update libpandas in your dev environment and
proceed with developing pure Python code.

Let me know what you think. I've spent the majority of my net development
time over the last year hardening Apache Arrow as a C++ library we can use
in libpandas for physical columnar in-memory management, so I'm ready (now
with the Arrow 0.3 release about ready to drop) to start making some more
progress on this.

*TOPIC TWO:* We discussed this on the last dev meeting call, but I wanted
to see what others think and if there's some action items. To help with
more frequent pandas releases, particularly of subcomponents which are pure
Python, I wonder if we could move toward a release model of "pandas" as a
metapackage for a series of subcomponents which are packaged independently.
As an example

pandas depends on
  pandas_display (Display for humans)
  pandas_io
  pandas_plotting
  pandas_core

and so forth. I think it would be better to go with a single codebase for
this; I don't have a strong opinion about having separate release cycles,
it's more to help establish cleaner boundaries about use of private and
public APIs. Effectively the codebase is already organized like this, so
I'm not sure concretely what we would want to do around this.

*TOPIC THREE:* I think we should start developing a "deferred pandas API"
that is designed and directly developed by the pandas developer community.
>From our respective experiences creating expression DSLs and other
computation frameworks on top of pandas, I believe this is something where
we can build something reasonable and useful. As one concrete problem this
would help with: addressing some of the awkwardness around complex
groupby-aggregate expressions (custom aggregations would simply be named
expressions).

The idea of the deferred expression API would be similar to dplyr in R:

* "True" schemas (we'll have to work around pandas 0.x warts with implicit
casts, etc.)

* Immutable data structures / no mutation outside "amend" operations that
change values by returning new objects

* Less index-related stuff in this API (perhaps this is controversial, we
shall see)

We can create an in-memory backend for "pandas expressions" on pandas
0.x/1.0 and separately create an alternative backend using libpandas (once
that is more fully baked / functional) -- this will also help provide a
forcing function for implementing analytics that are required for
implementing the backend.

Distributed execution for us is almost certainly out of scope, and even if
so we would probably want to offload onto prior art in Dask or elsewhere.
So if the dask.dataframe API and the pandas expression API look different
in ways that are unpleasant, we could either compile from pandas -> dask
under the hood, or make API changes to make the semantics more conforming.

When libpandas / pandas 2.0 is more mature we can consider building
stronger out-of-core execution (plenty of prior art we can learn from here,
e.g. SFrame).

As far as tools to implement the deferred expression API -- I will leave
this to discussion. I spent a considerable amount of time making a
pandas-like expression API for SQL in Ibis (see
https://github.com/cloudera/ibis/tree/master/ibis/expr) while I was at
Cloudera, so there's some ideas there (like separating the "internal" AST
from the "external" user expressions) that we can learn from, or fork or
use some of that expression code in some way. I don't have a strong opinion
as long as the expressions are as strongly-typed as possible (i.e. tables
have schemas, operations have checked input and output types) and catch
user errors as soon as feasible.

This ended up being more text than I planned. If we want to discuss these
things independently, feel free to send a reply with an altered subject
line. Looking forward to see what everyone thinks.

Thanks!
Wes
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20170503/3cce65c5/attachment.html>