<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jan 30, 2018 at 7:33 PM, Allan Haldane <span dir="ltr"><<a href="mailto:allanhaldane@gmail.com" target="_blank">allanhaldane@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-">On 01/30/2018 04:54 PM, <a href="mailto:josef.pktd@gmail.com">josef.pktd@gmail.com</a> wrote:<br>

><br>

><br>

> On Tue, Jan 30, 2018 at 3:21 PM, Allan Haldane <<a href="mailto:allanhaldane@gmail.com">allanhaldane@gmail.com</a><br>

</span><span class="gmail-">> <mailto:<a href="mailto:allanhaldane@gmail.com">allanhaldane@gmail.com</a><wbr>>> wrote:<br>

><br>

>     On 01/30/2018 01:33 PM, <a href="mailto:josef.pktd@gmail.com">josef.pktd@gmail.com</a><br>

</span><div><div class="gmail-h5">>     <mailto:<a href="mailto:josef.pktd@gmail.com">josef.pktd@gmail.com</a>> wrote:<br>

>     > AFAICS, one problem is that the padded view didn't come with the<br>

>     > matching down stream usage support, the pack function as mentioned, an<br>

>     > alternative way to convert to a standard ndarray, copy doesn't get rid<br>

>     > of the padding and so on.<br>

>     ><br>

>     > eg. another mailing list thread I just found with the same problem<br>

>     > <a href="http://numpy-discussion.10968.n7.nabble.com/view-of-recarray-issue-td32001.html" rel="noreferrer" target="_blank">http://numpy-discussion.10968.<wbr>n7.nabble.com/view-of-<wbr>recarray-issue-td32001.html</a><br>

>     <<a href="http://numpy-discussion.10968.n7.nabble.com/view-of-recarray-issue-td32001.html" rel="noreferrer" target="_blank">http://numpy-discussion.<wbr>10968.n7.nabble.com/view-of-<wbr>recarray-issue-td32001.html</a>><br>

>     ><br>

>     > quoting Ralf:<br>

>     > Question: is that really the recommended way to get an (N, 2) size float<br>

>     > array from two columns of a larger record array? If so, why isn't there<br>

>     > a better way? If you'd want to write to that (N, 2) array you have to<br>

>     > append a copy, making it even uglier. Also, then there really should be<br>

>     > tests for views in test_records.py.<br>

>     ><br>

>     ><br>

>     > This "better way" never showed up, AFAIK. And it looks like we came back<br>

>     > to this problem every few years.<br>

>     ><br>

>     > Josef<br>

><br>

>     Since we are at least pushing off this change to a later release<br>

>     (1.15?), we have some time to prepare/catch up.<br>

><br>

>     What can we add to numpy.lib.recfunctions to make the multi-field<br>

>     copy->view change smoother? We have discussed at least two functions:<br>

><br>

>      * repack_fields - rearrange the memory layout of a structured array to<br>

>     add/remove padding between fields<br>

><br>

>      * structured_to_unstructured - turns a n-D structured array into an<br>

>     (n+1)-D unstructured ndarray, whose dtype is the highest common type of<br>

>     all the fields. May want the inverse function too.<br>

><br>

><br>

> The only sticky point with statsmodels is to have an equivalent of<br>

> a[['b', 'c']].view(('f8', 2)).<br>

><br>

> Highest common dtype might be object, the main usecase for this is to<br>

> select some elements of a specific dtype and then use them as<br>

> standard,homogeneous ndarray. In our case and other cases that I have<br>

> seen it is mainly to select a subset of the floating point numbers.<br>

> Another case of this might be to combine two strings into one  a[['b',<br>

> 'c']].view(('S8'))    if b is s5 and c is S3, but I don't think I used<br>

> this in serious code.<br>

<br>

</div></div>I implemented and put up a draft of these functions in<br>

<a href="https://github.com/numpy/numpy/pull/10411" rel="noreferrer" target="_blank">https://github.com/numpy/<wbr>numpy/pull/10411</a></blockquote><div><br></div><div>Comments based on reading the last commit</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

<br>

I think they satisfy all your cases: code like<br>

<br>

    >>> a = np.ones(3, dtype=[('a', 'f8'), ('b', 'f8'), ('c', 'f8')])<br>

    >>> a[['b', 'c']].view(('f8', 2))`<br>

<br>

becomes:<br>

<br>

    >>> import numpy.lib.recfunctions as rf<br>

    >>> rf.structured_to_unstructured(<wbr>a[['b', 'c']])<br>

<span class="gmail-">    array([[1., 1.],<br>

           [1., 1.],<br>

</span>           [1., 1.]])<br>

<br>

The highest common dtype is usually not "Object", since I use<br>

`np.result_type` to determine the output type. So two fields of 'S5' and<br>

'S3' result in an 'S5' array.<br>

<span class="gmail-"><br></span></blockquote><div><br></div><div>structured_to_unstructured  looks good to me<br></div><div><br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-">

<br>

><br>

> for inverse function: I guess it is still possible to view any standard<br>

> homogenous ndarray with a structured dtype as long as the itemsize matches.<br>

<br>

</span>The inverse is implemented too. And it even supports varied field<br>

dtypes, nested fields, and subarrays, as you can see in the docstring<br>

examples.<br>

<span class="gmail-"><br>

<br>

> Browsing through old mailing list threads, I saw that adding multiple<br>

> fields or concatenating two arrays with structured dtypes into an array<br>

> with a single combined dtype was missing and I guess still is. (IIRC<br>

> this is the usecase where we go now the pandas detour in statsmodels.)<br>

><br>

>     We might also consider<br>

><br>

>      * apply_along_fields(arr, method) - applies the method along the<br>

>     "field" axis, equivalent to something like<br>

>     method(struct_to_unstructured(<wbr>arr), axis=-1)<br>

><br>

><br>

> If this works on a padded view of an existing array, then this would be<br>

> an improvement over the current version of having to extract and copy<br>

> the relevant fields of an existing structured dtype or loop over<br>

> different numeric dtypes, ints, floats.<br>

><br>

> In general there will need to be a way to apply `method` only to<br>

> selected columns, or columns of a matching dtype. (e.g. We don't want<br>

> the sum or mean of a string.)<br>

> (e.g. we use ptp() on numeric fields to check if there is already a<br>

> constant column in the array or dataframe)<br>

<br>

</span>Means over selected columns are accounted for using multi-field<br>

indexing. For example:<br>

<br>

    >>> b = np.array([(1, 2, 5), (4, 5, 7), (7, 8 ,11), (10, 11, 12)],<br>

    ...              dtype=[('x', 'i4'), ('y', 'f4'), ('z', 'f8')])<br>

<br>

    >>> rf.apply_along_fields(np.mean, b)<br>

    array([ 2.66666667,  5.33333333,  8.66666667, 11.        ])<br>

<br>

    >>> rf.apply_along_fields(np.mean, b[['x', 'z']])<br>

    array([ 3. ,  5.5,  9. , 11. ])<br></blockquote><div><br></div><div>actually, I would have expected apply_along_columns, i.e. reduce over all observations each field.</div><div>This might need an axis argument.</div><div><br></div><div>However, in the current form it is less practical than doing it ourselves with structured_to_unstructured because it makes a copy each time of all elements.</div><div><br></div><div>e.g.</div><div>


<span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline"><span> </span>rf.apply_along_fields(np.mean, b[['x', 'z']])</span>


<br></div><div>


<span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline"><span> </span>rf.apply_along_fields(np.std, b[['x', 'z']])</span>


<br></div><div><br></div><div>would do the same <span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">structured_to_unstructured copy of all array elements twice.</span></div><div><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline"><br></span></div><div><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">Josef</span></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

<br>

This is unaffected by the 1.14 to 1.15 changes.<br>

<br>

Allan<br>

<span class="gmail-"><br>

><br>

>  <br>

><br>

><br>

><br>

>     I think these are pretty minimal and shouldn't be too hard to implement.<br>

><br>

><br>

> AFAICS, it would cover the statsmodels usage.<br>

><br>

><br>

> Josef<br>

><br>

>  <br>

><br>

><br>

>     Allan<br>

>     ______________________________<wbr>_________________<br>

>     NumPy-Discussion mailing list<br>

</span>>     <a href="mailto:NumPy-Discussion@python.org">NumPy-Discussion@python.org</a> <mailto:<a href="mailto:NumPy-Discussion@python.org">NumPy-Discussion@<wbr>python.org</a>><br>

>     <a href="https://mail.python.org/mailman/listinfo/numpy-discussion" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/numpy-<wbr>discussion</a><br>

>     <<a href="https://mail.python.org/mailman/listinfo/numpy-discussion" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/numpy-<wbr>discussion</a>><br>

><br>

><br>

><br>

><br>

<div class="gmail-HOEnZb"><div class="gmail-h5">> ______________________________<wbr>_________________<br>

> NumPy-Discussion mailing list<br>

> <a href="mailto:NumPy-Discussion@python.org">NumPy-Discussion@python.org</a><br>

> <a href="https://mail.python.org/mailman/listinfo/numpy-discussion" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/numpy-<wbr>discussion</a><br>

><br>

<br>

______________________________<wbr>_________________<br>

NumPy-Discussion mailing list<br>

<a href="mailto:NumPy-Discussion@python.org">NumPy-Discussion@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/numpy-discussion" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/numpy-<wbr>discussion</a><br>

</div></div></blockquote></div><br></div></div>