[Numpy-discussion] Py3 merge
Francesc Alted
faltet at pytables.org
Tue Dec 8 06:03:41 EST 2009
A Monday 07 December 2009 16:32:50 Pauli Virtanen escrigué:
> ma, 2009-12-07 kello 09:50 -0500, Michael Droettboom kirjoitti:
> > Pauli Virtanen wrote:
>
> [clip]
>
> > > The character 'B' is already by unsigned bytes -- I wonder if it's easy
> > > to support 'B123' and plain 'B' at the same time, or whether we have to
> > > pick a different letter for "byte strings". 'y' would be free...
> >
> > It seems to me the motivation to change the 'S' dtype to something else
> > is to make things clearer with respect to the new conventions of Python
> > 3. (Where str -> bytes, and unicode -> str). In that sense, I'm not
> > sure there's any advantage going from "S" to "y" (particularly without
> > doing "U" to "S"), whereas there's a strong backward-compatibility
> > advantage to keep it as "S", though admittedly it's confusing to someone
> > who doesn't know the pre Python 3 history.
>
> I think a better plan is to deprecate "U" instead of "S".
>
> Also, I'm not completely convinced that staying with "S" == bytes has a
> strong backward-compatibility advantage:
>
> array(['foo']).dtype == 'U'
>
> and this will break code in several places.
That's true, but at least this can be attributed to a poor programming
practice. The same happens with:
array([1]).dtype == 'int32' # in 32-bit systems
array([1]).dtype == 'int64' # in 64-bit systems
and my impression is that int32/int64 duality for int default would hit much
more NumPy people than the "U"/"S" for string defaults.
> Also, for instance,
>
> array(['foo', 'bar'], dtype='S3')
>
> will result to encoding errors.
I don't think so. All existing code using the above idiom is using plain 7-
bit ascii character set with almost all certainty, so we should not expect
encoding errors here.
> We probably don't want to start
> implicitly casting Unicode to bytes, since Py3 does not do that either.
I agree.
> The only places where the dtype characters are used, AFAIK, is in repr
> and in the dtype kwarg -- they are not used in pickles etc.
>
> One can actually argue that changing 'U' to 'S' is more
> backward-compatible:
>
> array(['foo', 'bar'], dtype='S3')
>
> would still be valid code. Of course, the semantics change, but this
> anyway occurs also on the Python side when moving to Py3.
Mmh, as more I see this, the more I think that we can safely keep 'S' for
bytes and 'U' for unicode. The only glitch would be:
array(['foo']).dtype == 'U'
but again, I don't think this is going to break a lot of code.
> > I'm not sure your suggestion of making 'B' and 'B123' both work seems
> > like a good one because of the semantic differences between numbers and
> > strings. Would np.array(['a', 'b']) have a repr of [97, 98] or ['a',
> > 'b']? Sorting them would also not necessarily do the right thing.
>
> I think the point would be that 'B' and 'B1' would be treated as
> completely separate dtypes with different typenums -- they'd look
> similar only in the dtype character API (which is not so large) but not
> internally. np.array([b'a', b'b']).dtype would be 'B1'. Might be a bit
> confusing, though.
Yeah. Making 'B' and 'B1' so different types sounds very confusing, IMHO.
--
Francesc Alted
More information about the NumPy-Discussion
mailing list