Re: [Numpy-discussion] Setting custom dtypes and 1.14

31 Jan 2018

      On Tue, Jan 30, 2018 at 7:33 PM, Allan Haldane 
wrote:
...
On 01/30/2018 04:54 PM, josef.pktd@gmail.com wrote:
...
On Tue, Jan 30, 2018 at 3:21 PM, Allan Haldane mailto:allanhaldane@gmail.com> wrote:
On 01/30/2018 01:33 PM, josef.pktd@gmail.com
    mailto:josef.pktd@gmail.com wrote:
    > AFAICS, one problem is that the padded view didn't come with the
    > matching down stream usage support, the pack function as
mentioned, an
...
> alternative way to convert to a standard ndarray, copy doesn't get
rid
...
> of the padding and so on.
    >
    > eg. another mailing list thread I just found with the same problem
    > http://numpy-discussion.10968.n7.nabble.com/view-of-
recarray-issue-td32001.html
...
<http://numpy-discussion.10968.n7.nabble.com/view-of-
recarray-issue-td32001.html>
...
>
    > quoting Ralf:
    > Question: is that really the recommended way to get an (N, 2) size
float
...
> array from two columns of a larger record array? If so, why isn't
there
...
> a better way? If you'd want to write to that (N, 2) array you have
to
...
> append a copy, making it even uglier. Also, then there really
should be
...
> tests for views in test_records.py.
    >
    >
    > This "better way" never showed up, AFAIK. And it looks like we
came back
...
> to this problem every few years.
    >
    > Josef
Since we are at least pushing off this change to a later release
    (1.15?), we have some time to prepare/catch up.
What can we add to numpy.lib.recfunctions to make the multi-field
    copy->view change smoother? We have discussed at least two functions:
* repack_fields - rearrange the memory layout of a structured array
to
...
add/remove padding between fields
* structured_to_unstructured - turns a n-D structured array into an
    (n+1)-D unstructured ndarray, whose dtype is the highest common type
of
...
all the fields. May want the inverse function too.
The only sticky point with statsmodels is to have an equivalent of
a[['b', 'c']].view(('f8', 2)).
Highest common dtype might be object, the main usecase for this is to
select some elements of a specific dtype and then use them as
standard,homogeneous ndarray. In our case and other cases that I have
seen it is mainly to select a subset of the floating point numbers.
Another case of this might be to combine two strings into one  a[['b',
'c']].view(('S8'))    if b is s5 and c is S3, but I don't think I used
this in serious code.
I implemented and put up a draft of these functions in
https://github.com/numpy/numpy/pull/10411
Comments based on reading the last commit
...
I think they satisfy all your cases: code like
>>> a = np.ones(3, dtype=[('a', 'f8'), ('b', 'f8'), ('c', 'f8')])
    >>> a[['b', 'c']].view(('f8', 2))`
becomes:
>>> import numpy.lib.recfunctions as rf
    >>> rf.structured_to_unstructured(a[['b', 'c']])
    array([[1., 1.],
           [1., 1.],
           [1., 1.]])
The highest common dtype is usually not "Object", since I use
`np.result_type` to determine the output type. So two fields of 'S5' and
'S3' result in an 'S5' array.
structured_to_unstructured  looks good to me
...
...
for inverse function: I guess it is still possible to view any standard
homogenous ndarray with a structured dtype as long as the itemsize
matches.
The inverse is implemented too. And it even supports varied field
dtypes, nested fields, and subarrays, as you can see in the docstring
examples.
...
Browsing through old mailing list threads, I saw that adding multiple
fields or concatenating two arrays with structured dtypes into an array
with a single combined dtype was missing and I guess still is. (IIRC
this is the usecase where we go now the pandas detour in statsmodels.)
We might also consider
* apply_along_fields(arr, method) - applies the method along the
    "field" axis, equivalent to something like
    method(struct_to_unstructured(arr), axis=-1)
If this works on a padded view of an existing array, then this would be
an improvement over the current version of having to extract and copy
the relevant fields of an existing structured dtype or loop over
different numeric dtypes, ints, floats.
In general there will need to be a way to apply `method` only to
selected columns, or columns of a matching dtype. (e.g. We don't want
the sum or mean of a string.)
(e.g. we use ptp() on numeric fields to check if there is already a
constant column in the array or dataframe)
Means over selected columns are accounted for using multi-field
indexing. For example:
>>> b = np.array([(1, 2, 5), (4, 5, 7), (7, 8 ,11), (10, 11, 12)],
    ...              dtype=[('x', 'i4'), ('y', 'f4'), ('z', 'f8')])
>>> rf.apply_along_fields(np.mean, b)
    array([ 2.66666667,  5.33333333,  8.66666667, 11.        ])
>>> rf.apply_along_fields(np.mean, b[['x', 'z']])
    array([ 3. ,  5.5,  9. , 11. ])
actually, I would have expected apply_along_columns, i.e. reduce over all
observations each field.
This might need an axis argument.

However, in the current form it is less practical than doing it ourselves
with structured_to_unstructured because it makes a copy each time of all
elements.

e.g.
 rf.apply_along_fields(np.mean, b[['x', 'z']])
 rf.apply_along_fields(np.std, b[['x', 'z']])

would do the same structured_to_unstructured copy of all array elements
twice.

Josef
...
This is unaffected by the 1.14 to 1.15 changes.
Allan
...
I think these are pretty minimal and shouldn't be too hard to
implement.
...
AFAICS, it would cover the statsmodels usage.
Josef
Allan
    _______________________________________________
    NumPy-Discussion mailing list
    NumPy-Discussion@python.org mailto:NumPy-Discussion@python.org
    https://mail.python.org/mailman/listinfo/numpy-discussion
    https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Setting custom dtypes and 1.14

josef.pktd＠gmail.com