[Numpy-discussion] Py3 merge

Tue Dec 8 06:03:41 EST 2009

A Monday 07 December 2009 16:32:50 Pauli Virtanen escrigué:
> ma, 2009-12-07 kello 09:50 -0500, Michael Droettboom kirjoitti:
> > Pauli Virtanen wrote:
> 
> [clip]
> 
> > > The character 'B' is already by unsigned bytes -- I wonder if it's easy
> > > to support 'B123' and plain 'B' at the same time, or whether we have to
> > > pick a different letter for "byte strings". 'y' would be free...
> >
> > It seems to me the motivation to change the 'S' dtype to something else
> > is to make things clearer with respect to the new conventions of Python
> > 3.  (Where str -> bytes, and unicode -> str). In that sense, I'm not
> > sure there's any advantage going from "S" to "y" (particularly without
> > doing "U" to "S"), whereas there's a strong backward-compatibility
> > advantage to keep it as "S", though admittedly it's confusing to someone
> > who doesn't know the pre Python 3 history.
> 
> I think a better plan is to deprecate "U" instead of "S".
> 
> Also, I'm not completely convinced that staying with "S" == bytes has a
> strong backward-compatibility advantage:
> 
> 	array(['foo']).dtype == 'U'
> 
> and this will break code in several places.

That's true, but at least this can be attributed to a poor programming 
practice.  The same happens with:

array([1]).dtype == 'int32'  # in 32-bit systems
array([1]).dtype == 'int64'  # in 64-bit systems

and my impression is that int32/int64 duality for int default would hit much 
more NumPy people than the "U"/"S" for string defaults.

> Also, for instance,
> 
> 	array(['foo', 'bar'], dtype='S3')
> 
> will result to encoding errors.

I don't think so.  All existing code using the above idiom is using plain 7-
bit ascii character set with almost all certainty, so we should not expect 
encoding errors here.

> We probably don't want to start
> implicitly casting Unicode to bytes, since Py3 does not do that either.

I agree.

> The only places where the dtype characters are used, AFAIK, is in repr
> and in the dtype kwarg -- they are not used in pickles etc.
> 
> One can actually argue that changing 'U' to 'S' is more
> backward-compatible:
> 
> 	array(['foo', 'bar'], dtype='S3')
> 
> would still be valid code. Of course, the semantics change, but this
> anyway occurs also on the Python side when moving to Py3.

Mmh, as more I see this, the more I think that we can safely keep 'S' for 
bytes and 'U' for unicode.  The only glitch would be:

array(['foo']).dtype == 'U'

but again, I don't think this is going to break a lot of code.

> > I'm not sure your suggestion of making 'B' and 'B123' both work seems
> > like a good one because of the semantic differences between numbers and
> > strings. Would np.array(['a', 'b']) have a repr of [97, 98] or ['a',
> > 'b']?  Sorting them would also not necessarily do the right thing.
> 
> I think the point would be that 'B' and 'B1' would be treated as
> completely separate dtypes with different typenums -- they'd look
> similar only in the dtype character API (which is not so large) but not
> internally. np.array([b'a', b'b']).dtype would be 'B1'. Might be a bit
> confusing, though.

Yeah.  Making 'B' and 'B1' so different types sounds very confusing, IMHO.

-- 
Francesc Alted