[Numpy-discussion] What to do about structured string dtype and string regression?

Tue Feb 16 20:13:29 EST 2021

On Tue, Feb 16, 2021 at 3:13 PM Sebastian Berg <sebastian at sipsolutions.net>
wrote:

> Hi all,
>
> In https://github.com/numpy/numpy/issues/18407 it was reported that
> there is a regression for `np.array()` and friends in NumPy 1.20 for
> code such as:
>
>     np.array(["1234"], dtype=("U1", 4))
>     # NumPy 1.20: array(['1', '1', '1', '1'], dtype='<U1')
>     # NumPy 1.19: array(['1', '2', '3', '4'], dtype='<U1')
>
>
> The Basics
> ----------
>
> This happens when you ask for a rare "subarray" dtype, ways to create
> it are:
>
>     np.dtype(("U1", 4))
>     np.dtype("(4)U1,")  # (does not have a field, only a subarray)
>
> Both of which give the same subarray dtype a "U1" dtype with shape 4.
> One thing to know about these dtypes is that they cannot be attached to
> an array:
>
>     np.zeros(3, dtype="(4)U1,").dtype == "U1"
>     np.zeros(3, dtype="(4)U1,").shape == (3, 4)
>
> I.e. the shape is moved/added into the array itself (instead of
> remaining part of the dtype).
>
> The Change
> ----------
>
> Now what/why did something change?  When filling subarray dtypes, NumPy
> normally fills every element with the same input. In the above case in
> most cases NumPy will give the 1.20 result because it assigns "1234" to
> every subarray element individually; maybe confusingly, this truncates
> so that only the "1" is actually assigned, we can proof it with a
> structured dtype (same result in 1.19 and 1.20):
>
>     >>> np.array(["1234"], dtype="(4)U1,i")
>     array([(['1', '1', '1', '1'], 1234)],
>           dtype=[('f0', '<U1', (4,)), ('f1', '<i4')])
>
> Another, weirder case which changed (more obviously for the better is:
>
>     >>> np.array("1234", dtype="(4)U1,")
>     # Numpy 1.20: array(['1', '1', '1', '1'], dtype='<U1')
>     # NumPy 1.19: array(['1', '', '', ''], dtype='<U1')
>
> And, to point it out, we can have subarrays that are not 1-D:
>
>     >>> np.array(["12"],dtype=("(2,2)U1,"))
>     array([[['1', '1'],
>         ['2', '2']]], dtype='<U1')  # NumPy 1.19, 1.20 all is '1'
>
>
> The Cause
> ---------
>
> The cause of the 1.19 behaviour is two-fold:
>
> 1. The "subarray" part of the dtype is moved into the array after the
> dimension is found. At this point strings are always considered
> "scalars".  In most above examples, the new array shape is (1,)+(4,).
>
> 2. When filling the new array with values, it now has an _additional_
> dimension!  Because of this, the string is now suddenly considered a
> sequence, so it behaves the same as if `list("1234")`.  Although,
> normally, NumPy would never consider a string a sequence.
>
>
> The Solution?
> -------------
>
> I honestly don't have one.  We can consider strings as sequences in
> this weird special case.  That will probably create other weird special
> cases, but they would be even more hidden (I expect mainly odder things
> throwing an error).
>
> Should we try to document this better in the release notes or can we
> think of some better (or at least louder) solution?
>

There are way too many unsafe assumptions in this example. It's an edge
case of an edge case.

I don't think we should be beholden to continuing to support this
behavior, which was obviously never anticipated. If there was a way to
raise a warning or error in potentially ambiguous situations like this, I
would support it.

> Cheers,
>
> Sebastian
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/numpy-discussion/attachments/20210216/c5c81cf7/attachment-0001.html>