[Pandas-dev] How Far do we take ExtensionArrays?

Mon Feb 4 11:20:00 EST 2019

Hello Tom,

overall I really like the concept of ExtensionArrays but for more advanced
usage I think there is still a lot to do. At the moment, an implementer is
quite well off when the ExtensionArray can be coerced into a numpy array.
Once you have data that is not well represented by a numpy array, you need
to develop much more algorithms. For fletcher this has been a major hurdle
for me (or why I'm not implementing so much). This might also just be that
my backing library (Apache Arrow) is missing a lot of numerical operations
yet. I hope to have some time in the next months to work more on this and
then we can see how much issues pop up. At the end though, I would like to
avoid coercing as much as possible to NumPy arrays as the conversion of
arrays with null adds some computational overhead.

More comments inline.

Am Mi., 16. Jan. 2019 um 18:16 Uhr schrieb Tom Augspurger <
tom.augspurger88 at gmail.com>:

> This is something I've been mulling over the past few days: how much do we
> want
> ExtensionArrays to change pandas?
>
> They've been great so far at addressing some of the shortcomings of
> NumPy's type
> system, but I imagine that users will be interested in pushing things even
> further. For example, users have been asking for proper support for nested
> data.
> Now that we have ExtensionArrays, things are essentially solved at the
> memory
> level (by e.g. Apache Arrow). But, I imagine that the set of APIs
> typically used
> for nested data is quite different from those used for flat, tabular data
> pandas
> handles thus far. If we want to properly support nested data, what
> tolerance do
> we have for it "cluttering" the existing API?
>

Do we already have an example use case for nested data. For me it is hard
to image intuitive APIs for nested data without really good example use
cases.

>
> Finally (and this may be a topic for another day) have people thought
> about how
> 3rd-party EAs fit in with the potential block manager rewrite? IIUC, one
> of the
> goals there was a stable C API to the memory inside a DataFrame. Does
> anyone
> know how that would work with a array that doesn't (or can't) implement the
> buffer protocol?
>

As an example, all Arrow arrays cannot implement the buffer protocol as
each Array has at least 2 buffers (bitmap and the actual data). In fletcher
I have also used ChunkedArray as the backing object of a series. This
allows us to do operations like concat in constant time but also comes with
the cognitive overhead that data may not be stored as a single contiguous
memory array.

>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20190204/47621dd3/attachment.html>