[Numpy-discussion] What to do about structured string dtype and string regression?

Wed Feb 17 11:20:12 EST 2021

On Wed, 2021-02-17 at 11:15 +0100, Ralf Gommers wrote:
> On Wed, Feb 17, 2021 at 2:14 AM Stephan Hoyer <shoyer at gmail.com>
> wrote:
> 
> > On Tue, Feb 16, 2021 at 3:13 PM Sebastian Berg < 
> > sebastian at sipsolutions.net>
> > wrote:
> > 
> > > Hi all,
> > > 
> > > In https://github.com/numpy/numpy/issues/18407 it was reported
> > > that
> > > there is a regression for `np.array()` and friends in NumPy 1.20
> > > for
> > > code such as:
> > > 
> > >     np.array(["1234"], dtype=("U1", 4))
> > >     # NumPy 1.20: array(['1', '1', '1', '1'], dtype='<U1')
> > >     # NumPy 1.19: array(['1', '2', '3', '4'], dtype='<U1')
> > > 
> > > 
> > > The Basics
> > > ----------
> > > 
> > > This happens when you ask for a rare "subarray" dtype, ways to
> > > create
> > > it are:
> > > 
> > >     np.dtype(("U1", 4))
> > >     np.dtype("(4)U1,")  # (does not have a field, only a
> > > subarray)
> > > 
> > > Both of which give the same subarray dtype a "U1" dtype with
> > > shape 4.
> > > One thing to know about these dtypes is that they cannot be
> > > attached to
> > > an array:
> > > 
> > >     np.zeros(3, dtype="(4)U1,").dtype == "U1"
> > >     np.zeros(3, dtype="(4)U1,").shape == (3, 4)
> > > 
> > > I.e. the shape is moved/added into the array itself (instead of
> > > remaining part of the dtype).
> > > 
> > > The Change
> > > ----------
> > > 
> > > Now what/why did something change?  When filling subarray dtypes,
> > > NumPy
> > > normally fills every element with the same input. In the above
> > > case in
> > > most cases NumPy will give the 1.20 result because it assigns
> > > "1234" to
> > > every subarray element individually; maybe confusingly, this
> > > truncates
> > > so that only the "1" is actually assigned, we can proof it with a
> > > structured dtype (same result in 1.19 and 1.20):
> > > 
> > >     >>> np.array(["1234"], dtype="(4)U1,i")
> > >     array([(['1', '1', '1', '1'], 1234)],
> > >           dtype=[('f0', '<U1', (4,)), ('f1', '<i4')])
> > > 
> > > Another, weirder case which changed (more obviously for the
> > > better is:
> > > 
> > >     >>> np.array("1234", dtype="(4)U1,")
> > >     # Numpy 1.20: array(['1', '1', '1', '1'], dtype='<U1')
> > >     # NumPy 1.19: array(['1', '', '', ''], dtype='<U1')
> > > 
> > > And, to point it out, we can have subarrays that are not 1-D:
> > > 
> > >     >>> np.array(["12"],dtype=("(2,2)U1,"))
> > >     array([[['1', '1'],
> > >         ['2', '2']]], dtype='<U1')  # NumPy 1.19, 1.20 all is '1'
> > > 
> > > 
> > > The Cause
> > > ---------
> > > 
> > > The cause of the 1.19 behaviour is two-fold:
> > > 
> > > 1. The "subarray" part of the dtype is moved into the array after
> > > the
> > > dimension is found. At this point strings are always considered
> > > "scalars".  In most above examples, the new array shape is
> > > (1,)+(4,).
> > > 
> > > 2. When filling the new array with values, it now has an
> > > _additional_
> > > dimension!  Because of this, the string is now suddenly
> > > considered a
> > > sequence, so it behaves the same as if `list("1234")`.  Although,
> > > normally, NumPy would never consider a string a sequence.
> > > 
> > > 
> > > The Solution?
> > > -------------
> > > 
> > > I honestly don't have one.  We can consider strings as sequences
> > > in
> > > this weird special case.  That will probably create other weird
> > > special
> > > cases, but they would be even more hidden (I expect mainly odder
> > > things
> > > throwing an error).
> > > 
> > > Should we try to document this better in the release notes or can
> > > we
> > > think of some better (or at least louder) solution?
> > > 
> > 
> I was honestly surprised there's even such a thing as a "subarray
> data
> type", I've never seen it used in the wild. Looking at the release
> notes
> you already have,
>  
> https://numpy.org/devdocs/release/1.20.0-notes.html#arrays-cannot-be-using-subarray-dtypes
> ,
> all I'm thinking is that no one should ever be writing code like
> that.
> 

Sure, if you look at the big picture its arguably weird or even plain
wrong.  I guess the spelled out question here should have been:

    Does anyone think there is enough usage of this in the wild to
    worry about it?

based on the current response, it seems, and I hope not...

> 
> > There are way too many unsafe assumptions in this example. It's an
> > edge
> > case of an edge case.
> > 
> > I don't think we should be beholden to continuing to support this
> > behavior, which was obviously never anticipated. If there was a way
> > to
> > raise a warning or error in potentially ambiguous situations like
> > this, I
> > would support it.
> > 
> 

We can warn for all subarrays (including deprecation), but that is
probably too noisy/much.
We probably can flag subarray+strings and warn in that case. Just a
full undo seems tricky.  What I mean is a warning like:

    Oops, string+subarray can lead to weird things and unfortunately
    a fix in behaviour means 1.20 may have a different result compared
    to <1.19.x. (you are seeing the new behaviour, see release notes)

If that sounds useful, I can do it, but it will lead to an unavoidable
warning.

Cheers,

Sebastian

> +1
> 
> Ralf
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <https://mail.python.org/pipermail/numpy-discussion/attachments/20210217/3c2dad83/attachment.sig>