[Numpy-discussion] Setting custom dtypes and 1.14

josef.pktd at gmail.com josef.pktd at gmail.com
Tue Jan 30 22:09:11 EST 2018


On Tue, Jan 30, 2018 at 7:33 PM, Allan Haldane <allanhaldane at gmail.com>
wrote:

> On 01/30/2018 04:54 PM, josef.pktd at gmail.com wrote:
> >
> >
> > On Tue, Jan 30, 2018 at 3:21 PM, Allan Haldane <allanhaldane at gmail.com
> > <mailto:allanhaldane at gmail.com>> wrote:
> >
> >     On 01/30/2018 01:33 PM, josef.pktd at gmail.com
> >     <mailto:josef.pktd at gmail.com> wrote:
> >     > AFAICS, one problem is that the padded view didn't come with the
> >     > matching down stream usage support, the pack function as
> mentioned, an
> >     > alternative way to convert to a standard ndarray, copy doesn't get
> rid
> >     > of the padding and so on.
> >     >
> >     > eg. another mailing list thread I just found with the same problem
> >     > http://numpy-discussion.10968.n7.nabble.com/view-of-
> recarray-issue-td32001.html
> >     <http://numpy-discussion.10968.n7.nabble.com/view-of-
> recarray-issue-td32001.html>
> >     >
> >     > quoting Ralf:
> >     > Question: is that really the recommended way to get an (N, 2) size
> float
> >     > array from two columns of a larger record array? If so, why isn't
> there
> >     > a better way? If you'd want to write to that (N, 2) array you have
> to
> >     > append a copy, making it even uglier. Also, then there really
> should be
> >     > tests for views in test_records.py.
> >     >
> >     >
> >     > This "better way" never showed up, AFAIK. And it looks like we
> came back
> >     > to this problem every few years.
> >     >
> >     > Josef
> >
> >     Since we are at least pushing off this change to a later release
> >     (1.15?), we have some time to prepare/catch up.
> >
> >     What can we add to numpy.lib.recfunctions to make the multi-field
> >     copy->view change smoother? We have discussed at least two functions:
> >
> >      * repack_fields - rearrange the memory layout of a structured array
> to
> >     add/remove padding between fields
> >
> >      * structured_to_unstructured - turns a n-D structured array into an
> >     (n+1)-D unstructured ndarray, whose dtype is the highest common type
> of
> >     all the fields. May want the inverse function too.
> >
> >
> > The only sticky point with statsmodels is to have an equivalent of
> > a[['b', 'c']].view(('f8', 2)).
> >
> > Highest common dtype might be object, the main usecase for this is to
> > select some elements of a specific dtype and then use them as
> > standard,homogeneous ndarray. In our case and other cases that I have
> > seen it is mainly to select a subset of the floating point numbers.
> > Another case of this might be to combine two strings into one  a[['b',
> > 'c']].view(('S8'))    if b is s5 and c is S3, but I don't think I used
> > this in serious code.
>
> I implemented and put up a draft of these functions in
> https://github.com/numpy/numpy/pull/10411


Comments based on reading the last commit


>
>
> I think they satisfy all your cases: code like
>
>     >>> a = np.ones(3, dtype=[('a', 'f8'), ('b', 'f8'), ('c', 'f8')])
>     >>> a[['b', 'c']].view(('f8', 2))`
>
> becomes:
>
>     >>> import numpy.lib.recfunctions as rf
>     >>> rf.structured_to_unstructured(a[['b', 'c']])
>     array([[1., 1.],
>            [1., 1.],
>            [1., 1.]])
>
> The highest common dtype is usually not "Object", since I use
> `np.result_type` to determine the output type. So two fields of 'S5' and
> 'S3' result in an 'S5' array.
>
>
structured_to_unstructured  looks good to me



>
> >
> > for inverse function: I guess it is still possible to view any standard
> > homogenous ndarray with a structured dtype as long as the itemsize
> matches.
>
> The inverse is implemented too. And it even supports varied field
> dtypes, nested fields, and subarrays, as you can see in the docstring
> examples.
>
>
> > Browsing through old mailing list threads, I saw that adding multiple
> > fields or concatenating two arrays with structured dtypes into an array
> > with a single combined dtype was missing and I guess still is. (IIRC
> > this is the usecase where we go now the pandas detour in statsmodels.)
> >
> >     We might also consider
> >
> >      * apply_along_fields(arr, method) - applies the method along the
> >     "field" axis, equivalent to something like
> >     method(struct_to_unstructured(arr), axis=-1)
> >
> >
> > If this works on a padded view of an existing array, then this would be
> > an improvement over the current version of having to extract and copy
> > the relevant fields of an existing structured dtype or loop over
> > different numeric dtypes, ints, floats.
> >
> > In general there will need to be a way to apply `method` only to
> > selected columns, or columns of a matching dtype. (e.g. We don't want
> > the sum or mean of a string.)
> > (e.g. we use ptp() on numeric fields to check if there is already a
> > constant column in the array or dataframe)
>
> Means over selected columns are accounted for using multi-field
> indexing. For example:
>
>     >>> b = np.array([(1, 2, 5), (4, 5, 7), (7, 8 ,11), (10, 11, 12)],
>     ...              dtype=[('x', 'i4'), ('y', 'f4'), ('z', 'f8')])
>
>     >>> rf.apply_along_fields(np.mean, b)
>     array([ 2.66666667,  5.33333333,  8.66666667, 11.        ])
>
>     >>> rf.apply_along_fields(np.mean, b[['x', 'z']])
>     array([ 3. ,  5.5,  9. , 11. ])
>

actually, I would have expected apply_along_columns, i.e. reduce over all
observations each field.
This might need an axis argument.

However, in the current form it is less practical than doing it ourselves
with structured_to_unstructured because it makes a copy each time of all
elements.

e.g.
 rf.apply_along_fields(np.mean, b[['x', 'z']])
 rf.apply_along_fields(np.std, b[['x', 'z']])

would do the same structured_to_unstructured copy of all array elements
twice.

Josef



>
>
> This is unaffected by the 1.14 to 1.15 changes.
>
> Allan
>
> >
> >
> >
> >
> >
> >     I think these are pretty minimal and shouldn't be too hard to
> implement.
> >
> >
> > AFAICS, it would cover the statsmodels usage.
> >
> >
> > Josef
> >
> >
> >
> >
> >     Allan
> >     _______________________________________________
> >     NumPy-Discussion mailing list
> >     NumPy-Discussion at python.org <mailto:NumPy-Discussion at python.org>
> >     https://mail.python.org/mailman/listinfo/numpy-discussion
> >     <https://mail.python.org/mailman/listinfo/numpy-discussion>
> >
> >
> >
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at python.org
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> >
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20180130/c7aa8f24/attachment-0001.html>


More information about the NumPy-Discussion mailing list