[Numpy-discussion] Setting custom dtypes and 1.14

josef.pktd at gmail.com josef.pktd at gmail.com
Tue Jan 30 13:33:01 EST 2018


On Tue, Jan 30, 2018 at 12:28 PM, Allan Haldane <allanhaldane at gmail.com>
wrote:

> On 01/29/2018 11:50 PM, josef.pktd at gmail.com wrote:
>
>>
>>
>> On Mon, Jan 29, 2018 at 10:44 PM, Allan Haldane <allanhaldane at gmail.com
>> <mailto:allanhaldane at gmail.com>> wrote:
>>
>>     On 01/29/2018 05:59 PM, josef.pktd at gmail.com
>>     <mailto:josef.pktd at gmail.com> wrote:
>>
>>
>>
>>         On Mon, Jan 29, 2018 at 5:50 PM, <josef.pktd at gmail.com
>>         <mailto:josef.pktd at gmail.com> <mailto:josef.pktd at gmail.com
>>         <mailto:josef.pktd at gmail.com>>> wrote:
>>
>>
>>
>>              On Mon, Jan 29, 2018 at 4:11 PM, Allan Haldane
>>              <allanhaldane at gmail.com <mailto:allanhaldane at gmail.com>
>>         <mailto:allanhaldane at gmail.com <mailto:allanhaldane at gmail.com>>>
>>         wrote:
>>
>>                  On 01/29/2018 04:02 PM, josef.pktd at gmail.com
>>         <mailto:josef.pktd at gmail.com>
>>                  <mailto:josef.pktd at gmail.com
>>         <mailto:josef.pktd at gmail.com>> wrote:
>>                  >
>>                  >
>>                  > On Mon, Jan 29, 2018 at 3:44 PM, Benjamin Root
>>         <ben.v.root at gmail.com <mailto:ben.v.root at gmail.com>
>>         <mailto:ben.v.root at gmail.com <mailto:ben.v.root at gmail.com>>
>>                  > <mailto:ben.v.root at gmail.com
>>         <mailto:ben.v.root at gmail.com> <mailto:ben.v.root at gmail.com
>>         <mailto:ben.v.root at gmail.com>>>> wrote:
>>                  >
>>                  >     I <3 structured arrays. I love the fact that I
>>         can access data by
>>                  >     row and then by fieldname, or vice versa. There
>>         are times when I
>>                  >     need to pass just a column into a function, and
>>         there are times when
>>                  >     I need to process things row by row. Yes, pandas
>>         is nice if you want
>>                  >     the specialized indexing features, but it becomes
>>         a bear to deal
>>                  >     with if all you want is normal indexing, or even
>>         the ability to
>>                  >     easily loop over the dataset.
>>                  >
>>                  >
>>                  > I don't think there is a doubt that structured
>>         arrays, arrays with
>>                  > structured dtypes, are a useful container. The
>>         question is whether they
>>                  > should be more or the foundation for more.
>>                  >
>>                  > For example, computing a mean, or reduce operation,
>>         over numeric element
>>                  > ("columns"). Before padded views it was possible to
>>         index by selecting
>>                  > the relevant "columns" and view them as standard
>>         array. With padded
>>                  > views that breaks and AFAICS, there is no way in
>>         numpy 1.14.0 to compute
>>                  > a mean of some "columns". (I don't have numpy 1.14 to
>>         try or find a
>>                  > workaround, like maybe looping over all relevant
>>         columns.)
>>                  >
>>                  > Josef
>>
>>                  Just to clarify, structured types have always had
>>         padding bytes,
>>                  that
>>                  isn't new.
>>
>>                  What *is* new (which we are pushing to 1.15, I think)
>>         is that it
>>                  may be
>>                  somewhat more common to end up with padding than
>>         before, and
>>                  only if you
>>                  are specifically using multi-field indexing, which is a
>>         fairly
>>                  specialized case.
>>
>>                  I think recfunctions already account properly for
>>         padding bytes.
>>                  Except
>>                  for the bug in #8100, which we will fix, padding-bytes in
>>                  recarrays are
>>                  more or less invisible to a non-expert who only cares
>> about
>>                  dataframe-like behavior.
>>
>>                  In other words, padding is no obstacle at all to
>>         computing a
>>                  mean over a
>>                  column, and single-field indexes in 1.15 behave
>>         identically as
>>                  before.
>>                  The only thing that will change in 1.15 is multi-field
>>         indexing,
>>                  and it
>>                  has never been possible to compute a mean (or any binary
>>                  operation) on
>>                  multiple fields.
>>
>>
>>              from the example in the other thread
>>              a[['b', 'c']].view(('f8', 2)).mean(0)
>>
>>
>>              (from the statsmodels usecase:
>>              read csv with genfromtext to get recarray or structured array
>>              select/index the numeric columns
>>              view them as standard array
>>              do whatever we can do with standard numpy  arrays
>>              )
>>
>>
>>     Oh ok, I misunderstood. I see your point: a mean over fields is more
>>     difficult than before.
>>
>>         Or, to phrase it as a question:
>>
>>         How do we get a standard array with homogeneous dtype from the
>>         corresponding elements of a structured dtype in numpy 1.14.0?
>>
>>         Josef
>>
>>
>>     The answer may be that "numpy has never had a way to that",
>>     even if in a few special cases you might hack a workaround using
>> views.
>>
>>     That's what your example seems like to me. It uses an explicit view,
>>     which is an "expert" feature since views depend on the exact memory
>>     layout and binary representation of the array. Your example only
>>     works if the two fields have exactly the same dtype as each other
>>     and as the final dtype, and evidently breaks if there is byte
>>     padding for any reason.
>>
>>     Pandas can do row means without these problems:
>>
>>          >>> pd.DataFrame(np.ones(10, dtype='i8,f8')).mean(axis=0)
>>
>>     Numpy is missing this functionality, so you or whoever wrote that
>>     example figured out a fragile workaround using views.
>>
>>
>> Once upon a time (*) this wasn't fragile but the only and recommended
>> way. Because dtypes were low level with clear memory layout and stayed that
>> way, it was easy to check item size or whatever and get different views on
>> it.
>> e.g. https://mail.scipy.org/pipermail/numpy-discussion/2008-
>> December/039340.html
>>
>> (*) pre-pandas, pre-stackoverflow on the mailing lists which was for me
>> roughly 2008 to 2012
>> but a late thread https://mail.scipy.org/pipermail/numpy-discussion/2015-
>> October/074014.html
>> "What is now the recommended way of converting structured
>> dtypes/recarrays to ndarrays?"
>>
>>
>>
>>
>>     I suggest that if we want to allow either means over fields, or
>>     conversion of a n-D structured array to an n+1-D regular ndarray, we
>>     should add a dedicated function to do so in numpy.lib.recfunctions
>>     which does not depend on the binary representation of the array.
>>
>>
>> I don't really want to defend an obsolete (?) usecase of structured
>> dtypes.
>>
>> However, I think there should be a decision about the future plans for
>> whether dataframe like usages of structure dtypes or through higher level
>> classes or functions are still supported, instead of removing slowly and
>> silently (*) the foundation for this use case, either support this usage or
>> say you will be dropping it.
>>
>> (*) I didn't read the details of the release notes
>>
>>
>> And another footnote about obsolete:
>> Given that I'm the only one arguing about the dataframe_like usecase of
>> recarrays and structured dtypes, I think they are dead for this specific
>> usecase and only my inertia and conservativeness kept them alive in
>> statsmodels.
>>
>>
>> Josef
>>
>
> It's a bit of a stretch to say that we are "silently" dropping support for
> dataframe-like use of structured arrays.
>
> First, we still allow pretty much all dataframe-like use we have supported
> since numpy 1.7, limited as it may be. We are really only dropping one very
> specialized, expert use involving an explicit view, which I still have
> doubts was ever more than a hack. That 2008 mailing list message didn't
> involve multi-field indexing, which didn't exist then (only introduced in
> 2009), and we have wanted to make them views (not copies) since their
> inception.
>

The 2008 mailing list thread introduced me to the working with views on
structured arrays as the ONLY way to switch between structured and
homogenous dtypes (if the underlying item size was homogeneous).
The new stats.models started in 2009.


>
> Second, I don't think we are doing so silently: We have warned about this
> in release notes since numpy 1.7 in 2012/2013, and it gets mention in most
> releases since then. We have also raised FutureWarnings about it since 1.7.
> Unfortunately we missed warning in your specific case for a while, but we
> corrected this in 1.12 so you should have seen FutureWarnings since then.
>

If I see warnings in the test suite about getting a view instead copy from
numpy, then the only/main consequence I think about is whether I need to
watch out for inline modification.
I didn't expect that the followup computation would change, and that it's a
padded view and not a view on the selected memory. However, I just checked
and padding is mentioned in the 1.12 release notes (which I never read
before, ).

AFAICS, one problem is that the padded view didn't come with the matching
down stream usage support, the pack function as mentioned, an alternative
way to convert to a standard ndarray, copy doesn't get rid of the padding
and so on.

eg. another mailing list thread I just found with the same problem
http://numpy-discussion.10968.n7.nabble.com/view-of-recarray-issue-td32001.html

quoting Ralf:
Question: is that really the recommended way to get an (N, 2) size float
array from two columns of a larger record array? If so, why isn't there a
better way? If you'd want to write to that (N, 2) array you have to append
a copy, making it even uglier. Also, then there really should be tests for
views in test_records.py.


This "better way" never showed up, AFAIK. And it looks like we came back to
this problem every few years.

Josef


>
> I don't feel the need to officially declare that we are dropping support
> for dataframe-like use of structured arrays. It's unclear where that use
> ends and other uses of structured arrays begin. I think updating the docs
> to warn that pandas/dask may be a better choice is enough, as I've been
> doing, and then users can decide for themselves.


> There is still the question about whether we should make
> numpy.lib.recfunctions more official. I don't have a strong opinion. I
> suppose it would be good to add a section to the structured array docs
> which lists those methods and says something like
>
> "the submodule numpy.lib.recfunctions provides minimal functionality to
> split, combine, and manipulate structured datatypes and arrays. In most
> cases, we strongly recommend users use a dedicated module such as
> pandas/xarray/dask instead of these methods, but they are provided for
> occasional convenience."
>
> Allan
>
>
>
>     Allan
>>
>>
>>              Josef
>>
>>
>>                  Allan
>>
>>                  >
>>                  >     Cheers!
>>                  >     Ben Root
>>                  >
>>                  >     On Mon, Jan 29, 2018 at 3:24 PM,
>>         <josef.pktd at gmail.com <mailto:josef.pktd at gmail.com>
>>         <mailto:josef.pktd at gmail.com <mailto:josef.pktd at gmail.com>>
>>                  >     <mailto:josef.pktd at gmail.com
>>         <mailto:josef.pktd at gmail.com> <mailto:josef.pktd at gmail.com
>>         <mailto:josef.pktd at gmail.com>>>> wrote:
>>                  >
>>                  >
>>                  >
>>                  >         On Mon, Jan 29, 2018 at 2:55 PM, Stefan van
>>         der Walt
>>                  >         <stefanv at berkeley.edu
>>         <mailto:stefanv at berkeley.edu> <mailto:stefanv at berkeley.edu
>>         <mailto:stefanv at berkeley.edu>>
>>                  <mailto:stefanv at berkeley.edu
>>         <mailto:stefanv at berkeley.edu> <mailto:stefanv at berkeley.edu
>>         <mailto:stefanv at berkeley.edu>>>> wrote:
>>                  >
>>                  >             On Mon, 29 Jan 2018 14:10:56 -0500,
>>         josef.pktd at gmail.com <mailto:josef.pktd at gmail.com>
>>         <mailto:josef.pktd at gmail.com <mailto:josef.pktd at gmail.com>>
>>                   >             <mailto:josef.pktd at gmail.com
>>         <mailto:josef.pktd at gmail.com>
>>
>>                  <mailto:josef.pktd at gmail.com
>>         <mailto:josef.pktd at gmail.com>>> wrote:
>>                   >
>>                   >                 Given that there is pandas, xarray,
>>         dask and
>>                  more, numpy
>>                   >                 could as well drop
>>                   >                 any pretense of supporting
>>         dataframe_likes.
>>                  Or, adjust
>>                   >                 the recfunctions so
>>                   >                 we can still work dataframe_like
>>         with structured
>>                   >                 dtypes/recarrays/recfunctions.
>>                   >
>>                   >
>>                   >             I haven't been following the duckarray
>>         discussion
>>                  carefully,
>>                   >             but could
>>                   >             this be an opportunity for a dataframe
>>         protocol,
>>                  so that we
>>                   >             can have
>>                   >             libraries ingest structured arrays, record
>>                  arrays, pandas
>>                   >             dataframes,
>>                   >             etc. without too much specialized code?
>>                   >
>>                   >
>>                   >         AFAIU while not being in the data handling
>> area,
>>                  pandas defines
>>                   >         the interface and other libraries provide
>> pandas
>>                  compatible
>>                   >         interfaces or implementations.
>>                   >
>>                   >         statsmodels currently still has recarray
>>         support and
>>                  usage. In
>>                   >         some interfaces we support pandas, recarrays
>> and
>>                  plain arrays,
>>                   >         or anything where asarray works correctly.
>>                   >
>>                   >         But recarrays became messy to support, one
>>         rewrite of
>>                  some
>>                   >         functions last year converts recarrays to
>>         pandas,
>>                  does the
>>                   >         manipulation and then converts back to
>>         recarrays.
>>                   >         Also we need to adjust our recarray usage
>>         with new numpy
>>                   >         versions. But there is no real benefit
>> because I
>>                  doubt that
>>                   >         statsmodels still has any
>>         recarray/structured dtype
>>                  users. So,
>>                   >         we only have to remove our own uses in the
>>         datasets
>>                  and unit tests.
>>                   >
>>                   >         Josef
>>                   >
>>                   >
>>                   >
>>                   >
>>                   >             Stéfan
>>                   >
>>                   >                     _____________________________
>> __________________
>>                   >             NumPy-Discussion mailing list
>>                   > NumPy-Discussion at python.org
>>         <mailto:NumPy-Discussion at python.org>
>>                  <mailto:NumPy-Discussion at python.org
>>         <mailto:NumPy-Discussion at python.org>>
>>                  <mailto:NumPy-Discussion at python.org
>>         <mailto:NumPy-Discussion at python.org>
>>                  <mailto:NumPy-Discussion at python.org
>>         <mailto:NumPy-Discussion at python.org>>>
>>                   >
>>         https://mail.python.org/mailman/listinfo/numpy-discussion
>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>                         <https://mail.python.org/mailm
>> an/listinfo/numpy-discussion
>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>>
>>                  >                     <https://mail.python.org/mail
>> man/listinfo/numpy-discussion
>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>                         <https://mail.python.org/mailm
>> an/listinfo/numpy-discussion
>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>>>
>>                  >
>>                  >
>>                  >
>>                  >         _____________________________
>> __________________
>>                  >         NumPy-Discussion mailing list
>>                   > NumPy-Discussion at python.org
>>         <mailto:NumPy-Discussion at python.org>
>>                  <mailto:NumPy-Discussion at python.org
>>         <mailto:NumPy-Discussion at python.org>>
>>                  <mailto:NumPy-Discussion at python.org
>>         <mailto:NumPy-Discussion at python.org>
>>                  <mailto:NumPy-Discussion at python.org
>>         <mailto:NumPy-Discussion at python.org>>>
>>                   >
>>         https://mail.python.org/mailman/listinfo/numpy-discussion
>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>                         <https://mail.python.org/mailm
>> an/listinfo/numpy-discussion
>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>>
>>                  >                 <https://mail.python.org/mail
>> man/listinfo/numpy-discussion
>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>                         <https://mail.python.org/mailm
>> an/listinfo/numpy-discussion
>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>>>
>>                  >
>>                  >
>>                  >
>>                  >     _______________________________________________
>>                  >     NumPy-Discussion mailing list
>>                   > NumPy-Discussion at python.org
>>         <mailto:NumPy-Discussion at python.org>
>>                  <mailto:NumPy-Discussion at python.org
>>         <mailto:NumPy-Discussion at python.org>>
>>                  <mailto:NumPy-Discussion at python.org
>>         <mailto:NumPy-Discussion at python.org>
>>                  <mailto:NumPy-Discussion at python.org
>>         <mailto:NumPy-Discussion at python.org>>>
>>                   >
>>         https://mail.python.org/mailman/listinfo/numpy-discussion
>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>                         <https://mail.python.org/mailm
>> an/listinfo/numpy-discussion
>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>>
>>                   >                     <https://mail.python.org/mail
>> man/listinfo/numpy-discussion
>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>                         <https://mail.python.org/mailm
>> an/listinfo/numpy-discussion
>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>>>
>>                   >
>>                   >
>>                   >
>>                   >
>>                   > _______________________________________________
>>                   > NumPy-Discussion mailing list
>>                   > NumPy-Discussion at python.org
>>         <mailto:NumPy-Discussion at python.org>
>>         <mailto:NumPy-Discussion at python.org
>>         <mailto:NumPy-Discussion at python.org>>
>>                   >
>>         https://mail.python.org/mailman/listinfo/numpy-discussion
>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>                         <https://mail.python.org/mailm
>> an/listinfo/numpy-discussion
>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>>
>>                   >
>>
>>                  _______________________________________________
>>                  NumPy-Discussion mailing list
>>         NumPy-Discussion at python.org <mailto:NumPy-Discussion at python.org>
>>         <mailto:NumPy-Discussion at python.org
>>         <mailto:NumPy-Discussion at python.org>>
>>         https://mail.python.org/mailman/listinfo/numpy-discussion
>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>                         <https://mail.python.org/mailm
>> an/listinfo/numpy-discussion
>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>>
>>
>>
>>
>>
>>
>>         _______________________________________________
>>         NumPy-Discussion mailing list
>>         NumPy-Discussion at python.org <mailto:NumPy-Discussion at python.org>
>>         https://mail.python.org/mailman/listinfo/numpy-discussion
>>         <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>
>>
>>     _______________________________________________
>>     NumPy-Discussion mailing list
>>     NumPy-Discussion at python.org <mailto:NumPy-Discussion at python.org>
>>     https://mail.python.org/mailman/listinfo/numpy-discussion
>>     <https://mail.python.org/mailman/listinfo/numpy-discussion>
>>
>>
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20180130/b9a763dd/attachment-0001.html>


More information about the NumPy-Discussion mailing list