On 01/29/2018 05:59 PM, josef.pktd@gmail.com wrote:
On Mon, Jan 29, 2018 at 5:50 PM, <josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>> wrote:
On Mon, Jan 29, 2018 at 4:11 PM, Allan Haldane
<allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote: <mailto:josef.pktd@gmail.com> wrote:
On 01/29/2018 04:02 PM, josef.pktd@gmail.com
>
>
> On Mon, Jan 29, 2018 at 3:44 PM, Benjamin Root <ben.v.root@gmail.com <mailto:ben.v.root@gmail.com>> <mailto:ben.v.root@gmail.com <mailto:ben.v.root@gmail.com>>> wrote:
>
> I <3 structured arrays. I love the fact that I can access data by
> row and then by fieldname, or vice versa. There are times when I
> need to pass just a column into a function, and there are times when
> I need to process things row by row. Yes, pandas is nice if you want
> the specialized indexing features, but it becomes a bear to deal
> with if all you want is normal indexing, or even the ability to
> easily loop over the dataset.
>
>
> I don't think there is a doubt that structured arrays, arrays with
> structured dtypes, are a useful container. The question is whether they
> should be more or the foundation for more.
>
> For example, computing a mean, or reduce operation, over numeric element
> ("columns"). Before padded views it was possible to index by selecting
> the relevant "columns" and view them as standard array. With padded
> views that breaks and AFAICS, there is no way in numpy 1.14.0 to compute
> a mean of some "columns". (I don't have numpy 1.14 to try or find a
> workaround, like maybe looping over all relevant columns.)
>
> Josef
Just to clarify, structured types have always had padding bytes,
that
isn't new.
What *is* new (which we are pushing to 1.15, I think) is that it
may be
somewhat more common to end up with padding than before, and
only if you
are specifically using multi-field indexing, which is a fairly
specialized case.
I think recfunctions already account properly for padding bytes.
Except
for the bug in #8100, which we will fix, padding-bytes in
recarrays are
more or less invisible to a non-expert who only cares about
dataframe-like behavior.
In other words, padding is no obstacle at all to computing a
mean over a
column, and single-field indexes in 1.15 behave identically as
before.
The only thing that will change in 1.15 is multi-field indexing,
and it
has never been possible to compute a mean (or any binary
operation) on
multiple fields.
from the example in the other thread
a[['b', 'c']].view(('f8', 2)).mean(0)
(from the statsmodels usecase:
read csv with genfromtext to get recarray or structured array
select/index the numeric columns
view them as standard array
do whatever we can do with standard numpy arrays
)
Oh ok, I misunderstood. I see your point: a mean over fields is more difficult than before.
Or, to phrase it as a question:
How do we get a standard array with homogeneous dtype from the corresponding elements of a structured dtype in numpy 1.14.0?
Josef
The answer may be that "numpy has never had a way to that",
even if in a few special cases you might hack a workaround using views.
That's what your example seems like to me. It uses an explicit view, which is an "expert" feature since views depend on the exact memory layout and binary representation of the array. Your example only works if the two fields have exactly the same dtype as each other and as the final dtype, and evidently breaks if there is byte padding for any reason.
Pandas can do row means without these problems:
>>> pd.DataFrame(np.ones(10, dtype='i8,f8')).mean(axis=0)
Numpy is missing this functionality, so you or whoever wrote that example figured out a fragile workaround using views.
I suggest that if we want to allow either means over fields, or conversion of a n-D structured array to an n+1-D regular ndarray, we should add a dedicated function to do so in numpy.lib.recfunctions
which does not depend on the binary representation of the array.
Allan
Josef
Allan
>
> Cheers!
> Ben Root
>
> On Mon, Jan 29, 2018 at 3:24 PM, <josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>
> <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>>> wrote: <mailto:stefanv@berkeley.edu <mailto:stefanv@berkeley.edu>>
>
>
>
> On Mon, Jan 29, 2018 at 2:55 PM, Stefan van der Walt
> <stefanv@berkeley.edu <mailto:stefanv@berkeley.edu>> wrote: > <mailto:josef.pktd@gmail.com
>
> On Mon, 29 Jan 2018 14:10:56 -0500, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com><mailto:NumPy-Discussion@pytho
<mailto:josef.pktd@gmail.com>> wrote:
>
> Given that there is pandas, xarray, dask and
more, numpy
> could as well drop
> any pretense of supporting dataframe_likes.
Or, adjust
> the recfunctions so
> we can still work dataframe_like with structured
> dtypes/recarrays/recfunctions.
>
>
> I haven't been following the duckarray discussion
carefully,
> but could
> this be an opportunity for a dataframe protocol,
so that we
> can have
> libraries ingest structured arrays, record
arrays, pandas
> dataframes,
> etc. without too much specialized code?
>
>
> AFAIU while not being in the data handling area,
pandas defines
> the interface and other libraries provide pandas
compatible
> interfaces or implementations.
>
> statsmodels currently still has recarray support and
usage. In
> some interfaces we support pandas, recarrays and
plain arrays,
> or anything where asarray works correctly.
>
> But recarrays became messy to support, one rewrite of
some
> functions last year converts recarrays to pandas,
does the
> manipulation and then converts back to recarrays.
> Also we need to adjust our recarray usage with new numpy
> versions. But there is no real benefit because I
doubt that
> statsmodels still has any recarray/structured dtype
users. So,
> we only have to remove our own uses in the datasets
and unit tests.
>
> Josef
>
>
>
>
> Stéfan
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
<mailto:NumPy-Discussion@python.org >n.org
<mailto:NumPy-Discussion@python.org >>
> https://mail.python.org/mailman/listinfo/numpy-discussion
<https://mail.python.org/mailman/listinfo/numpy-discussion >
> <https://mail.python.org/mailman/listinfo/numpy-discussion
<https://mail.python.org/mailman/listinfo/numpy-discussion >>
>
>
>
> _______________________________________________ <mailto:NumPy-Discussion@pytho
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
<mailto:NumPy-Discussion@python.org >n.org
<mailto:NumPy-Discussion@python.org >>
> https://mail.python.org/mailman/listinfo/numpy-discussion
<https://mail.python.org/mailman/listinfo/numpy-discussion >
> <https://mail.python.org/mailman/listinfo/numpy-discussion
<https://mail.python.org/mailman/listinfo/numpy-discussion >>
>
>
>
> _______________________________________________ <mailto:NumPy-Discussion@pytho
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
<mailto:NumPy-Discussion@python.org >n.org
<mailto:NumPy-Discussion@python.org >>
> https://mail.python.org/mailman/listinfo/numpy-discussion
<https://mail.python.org/mailman/listinfo/numpy-discussion >
> <https://mail.python.org/mailman/listinfo/numpy-discussion
<https://mail.python.org/mailman/listinfo/numpy-discussion >>
>
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org >
> https://mail.python.org/mailman/listinfo/numpy-discussion
<https://mail.python.org/mailman/listinfo/numpy-discussion >
>
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org >
https://mail.python.org/mailman/listinfo/numpy-discussion
<https://mail.python.org/mailman/listinfo/numpy-discussion >
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion