[Numpy-discussion] Bytes vs. Unicode in Python3

Fri Nov 27 15:08:41 EST 2009

pe, 2009-11-27 kello 10:36 -0800, Christopher Barker kirjoitti:
[clip]
> > Which one it will
> > be should depend on the use. Users will expect that eg. array([1,2,3],
> > dtype='f4') still works, and they don't have to do e.g. array([1,2,3],
> > dtype=b'f4').
> 
> Personally, I try to use np.float32 instead, anyway, but I digress. In 
> this case, the "type code" is supposed to be a human-readable bit of 
> text -- it should be a unicode object (convertible to ascii for 
> interfacing with C...)

Yes, this would solve the repr() issue easily. Now that I look more
closely, the format strings are not actually used anywhere else than in
the descriptor user interface, so from an implementation POV Unicode is
not any harder.

[clip]
> Pauli Virtanen wrote:
> > 'U'
> > is same as Python 3 unicode and probably in same internal representation
> > (need to check). Neither is associated with encoding info.
> 
> Isn't it? I thought the encoding was always the same internally? so it 
> is known?

Yes, so it needs not be associated with a separate piece of encoding
info.

[clip]
> >    Maybe we want to introduce a separate "bytes" dtype that's an alias
> >    for 'S'?
> 
> What do we need "bytes" for? does it support anything that np.uint8 
> doesn't?

It has a string representation, but that's probably it.

Actually, in Python 3, when you index a bytes object, you get integers
back, so we just aliasing bytes_ = uint8 and making sure array() handles
byte objects appropriately would be more or less consistent.

> > 2) The field names:
> > 
> > 	a = array([], dtype=[('a', int)])
> > 	a = array([], dtype=[(b'a', int)])
> > 
> > This is somewhat of an internal issue. We need to decide whether we
> > internally coerce input to Unicode or Bytes.
> 
> Unicode is clear to me here -- it really should match what Python does 
> for variable names -- that is unicode in py3k, no?

Yep, let's follow Python. So Unicode and only Unicode it is.

    ***

Ok, thanks for the feedback. The right answers seem to be:

1) Unicode works as it is now, and Python3 strings are Unicode.

   Bytes objects are coerced to uint8 by array(). We don't do implicit
   conversions between Bytes and Unicode.

   The 'S' dtype character will be deprecated, never appear in repr(),
   and its usage will result to a warning.

2) Field names are unicode always.

   Some backward compatibility needs to be added in pickling, and
   maybe the npy file format needs a fixed encoding.

3) Dtype strings are an user interface detail, and will be Unicode.

-- 
Pauli Virtanen