On Tue, Jan 30, 2018 at 3:21 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
On 01/30/2018 01:33 PM, josef.pktd@gmail.com wrote:
> AFAICS, one problem is that the padded view didn't come with the
> matching down stream usage support, the pack function as mentioned, an
> alternative way to convert to a standard ndarray, copy doesn't get rid
> of the padding and so on.
> eg. another mailing list thread I just found with the same problem
> http://numpy-discussion.10968.n7.nabble.com/view-of-recarray-issue-td32001.html
> quoting Ralf:
> Question: is that really the recommended way to get an (N, 2) size float
> array from two columns of a larger record array? If so, why isn't there
> a better way? If you'd want to write to that (N, 2) array you have to
> append a copy, making it even uglier. Also, then there really should be
> tests for views in test_records.py.
> This "better way" never showed up, AFAIK. And it looks like we came back
> to this problem every few years.
> Josef

Since we are at least pushing off this change to a later release
(1.15?), we have some time to prepare/catch up.

What can we add to numpy.lib.recfunctions to make the multi-field
copy->view change smoother? We have discussed at least two functions:

 * repack_fields - rearrange the memory layout of a structured array to
add/remove padding between fields

 * structured_to_unstructured - turns a n-D structured array into an
(n+1)-D unstructured ndarray, whose dtype is the highest common type of
all the fields. May want the inverse function too.

The only sticky point with statsmodels is to have an equivalent of
a[['b', 'c']].view(('f8', 2)).

Highest common dtype might be object, the main usecase for this is to select some elements of a specific dtype and then use them as standard,homogeneous ndarray. In our case and other cases that I have seen it is mainly to select a subset of the floating point numbers.
Another case of this might be to combine two strings into one  a[['b', 'c']].view(('S8'))    if b is s5 and c is S3, but I don't think I used this in serious code.

for inverse function: I guess it is still possible to view any standard homogenous ndarray with a structured dtype as long as the itemsize matches.

Browsing through old mailing list threads, I saw that adding multiple fields or concatenating two arrays with structured dtypes into an array with a single combined dtype was missing and I guess still is. (IIRC this is the usecase where we go now the pandas detour in statsmodels.)


We might also consider

 * apply_along_fields(arr, method) - applies the method along the
"field" axis, equivalent to something like
method(struct_to_unstructured(arr), axis=-1)

If this works on a padded view of an existing array, then this would be an improvement over the current version of having to extract and copy the relevant fields of an existing structured dtype or loop over different numeric dtypes, ints, floats.

In general there will need to be a way to apply `method` only to selected columns, or columns of a matching dtype. (e.g. We don't want the sum or mean of a string.)
(e.g. we use ptp() on numeric fields to check if there is already a constant column in the array or dataframe)


I think these are pretty minimal and shouldn't be too hard to implement.

AFAICS, it would cover the statsmodels usage.



NumPy-Discussion mailing list