[Numpy-discussion] Setting custom dtypes and 1.14
josef.pktd at gmail.com
josef.pktd at gmail.com
Mon Jan 29 17:59:29 EST 2018
On Mon, Jan 29, 2018 at 5:50 PM, <josef.pktd at gmail.com> wrote:
On Mon, Jan 29, 2018 at 4:11 PM, Allan Haldane <allanhaldane at gmail.com> wrote:
> wrote:
On 01/29/2018 04:02 PM, josef.pktd at gmail.com wrote:
On Mon, Jan 29, 2018 at 3:44 PM, Benjamin Root <ben.v.root at gmail.com> wrote:
>> > <mailto:ben.v.root at gmail.com>> wrote:
>> > I <3 structured arrays. I love the fact that I can access data by
>> > row and then by fieldname, or vice versa. There are times when I
>> > need to pass just a column into a function, and there are times when
>> > I need to process things row by row. Yes, pandas is nice if you want
>> > the specialized indexing features, but it becomes a bear to deal
>> > with if all you want is normal indexing, or even the ability to
>> > easily loop over the dataset.
>> > I don't think there is a doubt that structured arrays, arrays with
>> > structured dtypes, are a useful container. The question is whether they
>> > should be more or the foundation for more.
>> > For example, computing a mean, or reduce operation, over numeric element
>> > ("columns"). Before padded views it was possible to index by selecting
>> > the relevant "columns" and view them as standard array. With padded
>> > views that breaks and AFAICS, there is no way in numpy 1.14.0 to compute
>> > a mean of some "columns". (I don't have numpy 1.14 to try or find a
>> > workaround, like maybe looping over all relevant columns.)
>> >
>> > Josef
>> Just to clarify, structured types have always had padding bytes, that
>> isn't new.
>>
>> What *is* new (which we are pushing to 1.15, I think) is that it may be
>> somewhat more common to end up with padding than before, and only if you
>> are specifically using multi-field indexing, which is a fairly
>> specialized case.
>>
>> I think recfunctions already account properly for padding bytes. Except
>> for the bug in #8100, which we will fix, padding-bytes in recarrays are
>> more or less invisible to a non-expert who only cares about
>> dataframe-like behavior.
>>
>> In other words, padding is no obstacle at all to computing a mean over a
>> column, and single-field indexes in 1.15 behave identically as before.
>> The only thing that will change in 1.15 is multi-field indexing, and it
>> has never been possible to compute a mean (or any binary operation) on
>> multiple fields.
>>
> from the example in the other thread
> a[['b', 'c']].view(('f8', 2)).mean(0)
>
>
> (from the statsmodels usecase:
> read csv with genfromtext to get recarray or structured array
> select/index the numeric columns
> view them as standard array
> do whatever we can do with standard numpy arrays
> )
>
Or, to phrase it as a question:
How do we get a standard array with homogeneous dtype from the
corresponding elements of a structured dtype in numpy 1.14.0?
Josef
> Josef
On Mon, Jan 29, 2018 at 3:24 PM, <josef.pktd at gmail.com> wrote:
>> > <mailto:josef.pktd at gmail.com>> wrote:
>> >
On Mon, Jan 29, 2018 at 2:55 PM, Stefan van der Walt <stefanv at berkeley.edu> wrote:
>> > <stefanv at berkeley.edu <mailto:stefanv at berkeley.edu>> wrote:
>> >
On Mon, 29 Jan 2018 14:10:56 -0500, josef.pktd at gmail.com wrote:
>> > <mailto:josef.pktd at gmail.com> wrote:
>> >
>> > Given that there is pandas, xarray, dask and more, numpy
>> > could as well drop
>> > any pretense of supporting dataframe_likes. Or, adjust
>> > the recfunctions so
>> > we can still work dataframe_like with structured
>> > dtypes/recarrays/recfunctions.
>> >
>> >
>> > I haven't been following the duckarray discussion carefully,
>> > but could
>> > this be an opportunity for a dataframe protocol, so that we
>> > can have
>> > libraries ingest structured arrays, record arrays, pandas
>> > dataframes,
>> > etc. without too much specialized code?
>> >
>> >
>> > AFAIU while not being in the data handling area, pandas defines
>> > the interface and other libraries provide pandas compatible
>> > interfaces or implementations.
>> >
>> > statsmodels currently still has recarray support and usage. In
>> > some interfaces we support pandas, recarrays and plain arrays,
>> > or anything where asarray works correctly.
>> >
>> > But recarrays became messy to support, one rewrite of some
>> > functions last year converts recarrays to pandas, does the
>> > manipulation and then converts back to recarrays.
>> > Also we need to adjust our recarray usage with new numpy
>> > versions. But there is no real benefit because I doubt that
>> > statsmodels still has any recarray/structured dtype users. So,
>> > we only have to remove our own uses in the datasets and unit
>> tests.
>> >
>> > Josef
>> >
