[Numpy-discussion] Bytes vs. Unicode in Python3

Fri Nov 27 13:36:36 EST 2009

> The point is that I don't think we can just decide to use Unicode or
> Bytes in all places where PyString was used earlier.

Agreed.

I think it's helpful to remember the origins of all this:

IMHO, there are two distinct types of data that Python2 strings support:

1) text: this is the traditional "string".
2) bytes: raw bytes -- they could represent anything.

This, of course, is what the py3k string and bytes types are all about.

However, when python started, it just so happened that text was 
represented by an array of unsigned single byte integers, so there 
really was no point in having a "bytes" type, as a string would work 
just as well.

Enter unicode:

Now we have multiple ways of representing text internally, but want a 
single interface to that -- one that looks and acts like a sequence of 
characters to user's code. The result is that the unicode type was 
introduced.

In a way, unicode strings are a bit like arrays: they have an encoding 
associated with them (like a dtype in numpy). You can represent a given 
bit of text in multiple different arangements of bytes, but they are all 
supposed to mean the same thing and, if you know the encoding, you can 
convert between them. This is kind of like how one can represent 5 in 
any of many dtypes: uint8, int16, int32, float32, float64, etc. Not any 
value represented by one dtype can be converted to all other dtypes, but 
many can. Just like encodings.

Anyway, all this brings me to think about the use of strings in numpy in 
this way: if it is meant to be a human-readable piece of text, it should 
be a unicode object. If not, then it is bytes.

So: "fromstring" and the like should, of course, work with bytes (though 
maybe buffers really...)

> Which one it will
> be should depend on the use. Users will expect that eg. array([1,2,3],
> dtype='f4') still works, and they don't have to do e.g. array([1,2,3],
> dtype=b'f4').

Personally, I try to use np.float32 instead, anyway, but I digress. In 
this case, the "type code" is supposed to be a human-readable bit of 
text -- it should be a unicode object (convertible to ascii for 
interfacing with C...)

If we used b'f4', it would confuse things, as it couldn't be printed 
right. Also: would the actual bytes involved potentially change 
depending on what encoding was used for the literal? i.e. if the code 
was written in utf16, would that byte string be 4 bytes long?

> To summarize the use cases I've ran across so far:
> 
> 1) For 'S' dtype, I believe we use Bytes for the raw data and the
>    interface.

I don't think so here. 'S' is usually used to store human-readable 
strings, I'd certainly expect to be able to do:

s_array = np.array(['this', 'that'], dtype='S10')

And I'd expect it to work with non-literals that were unicode strings, 
i.e. human readable text. In fact, it's pretty rare that I'd ever want 
bytes here. So I'd see 'S' mapped to 'U' here.

Francesc Alted wrote:
> the next  should still work:
> 
> In [2]: s = np.array(['asa'], dtype="S10")
> 
> In [3]: s[0]
> Out[3]: 'asa'  # will become b'asa' in Python 3

I don't like that -- I put in a string, and get a bytes object back?

> In [4]: s.dtype.itemsize
> Out[4]: 10     # still 1-byte per element

But what it the the strings passed in aren't representable in one byte 
per character? Do we define "S" as only supporting ANSI-only string? 
what encoding?

Pauli Virtanen wrote:
> 'U'
> is same as Python 3 unicode and probably in same internal representation
> (need to check). Neither is associated with encoding info.

Isn't it? I thought the encoding was always the same internally? so it 
is known?

Francesc Alted wrote:
> That could be a good idea because that would ensure compatibility with 
> existing NumPy scripts (i.e. old 'string' dtypes are mapped to 'bytes', as it 
> should).

What do you mean by compatible? It wold mean a lot of user code would 
have to change with the 2->3 transition.

> The only thing that I don't like is that that 'S' seems to be the 
> initial letter for 'string', which is actually 'unicode' in Python 3 :-/
> But, for the sake of compatibility, we can probably live with that.

I suppose we could at least depricate it.

>> Also, what will a bytes dtype mean within a py2 program context?  Does
>> it matter if the bytes dtype just fails somehow if used in a py2
>> program?

well, it should work in 2.6 anyway.

>    Maybe we want to introduce a separate "bytes" dtype that's an alias
>    for 'S'?

What do we need "bytes" for? does it support anything that np.uint8 
doesn't?

> 2) The field names:
> 
> 	a = array([], dtype=[('a', int)])
> 	a = array([], dtype=[(b'a', int)])
> 
> This is somewhat of an internal issue. We need to decide whether we
> internally coerce input to Unicode or Bytes.

Unicode is clear to me here -- it really should match what Python does 
for variable names -- that is unicode in py3k, no?

> 3) Format strings
> 
> 	a = array([], dtype=b'i4')
> 
> I don't think it makes sense to handle format strings in Unicode
> internally -- they should always be coerced to bytes.

This should be fine -- we control what is a valid format string, and 
thus they can always be ASCII-safe.

-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov