[Numpy-discussion] Bytes vs. Unicode in Python3
Christopher Barker
Chris.Barker at noaa.gov
Fri Nov 27 13:36:36 EST 2009
> The point is that I don't think we can just decide to use Unicode or
> Bytes in all places where PyString was used earlier.
Agreed.
I think it's helpful to remember the origins of all this:
IMHO, there are two distinct types of data that Python2 strings support:
1) text: this is the traditional "string".
2) bytes: raw bytes -- they could represent anything.
This, of course, is what the py3k string and bytes types are all about.
However, when python started, it just so happened that text was
represented by an array of unsigned single byte integers, so there
really was no point in having a "bytes" type, as a string would work
just as well.
Enter unicode:
Now we have multiple ways of representing text internally, but want a
single interface to that -- one that looks and acts like a sequence of
characters to user's code. The result is that the unicode type was
introduced.
In a way, unicode strings are a bit like arrays: they have an encoding
associated with them (like a dtype in numpy). You can represent a given
bit of text in multiple different arangements of bytes, but they are all
supposed to mean the same thing and, if you know the encoding, you can
convert between them. This is kind of like how one can represent 5 in
any of many dtypes: uint8, int16, int32, float32, float64, etc. Not any
value represented by one dtype can be converted to all other dtypes, but
many can. Just like encodings.
Anyway, all this brings me to think about the use of strings in numpy in
this way: if it is meant to be a human-readable piece of text, it should
be a unicode object. If not, then it is bytes.
So: "fromstring" and the like should, of course, work with bytes (though
maybe buffers really...)
> Which one it will
> be should depend on the use. Users will expect that eg. array([1,2,3],
> dtype='f4') still works, and they don't have to do e.g. array([1,2,3],
> dtype=b'f4').
Personally, I try to use np.float32 instead, anyway, but I digress. In
this case, the "type code" is supposed to be a human-readable bit of
text -- it should be a unicode object (convertible to ascii for
interfacing with C...)
If we used b'f4', it would confuse things, as it couldn't be printed
right. Also: would the actual bytes involved potentially change
depending on what encoding was used for the literal? i.e. if the code
was written in utf16, would that byte string be 4 bytes long?
> To summarize the use cases I've ran across so far:
>
> 1) For 'S' dtype, I believe we use Bytes for the raw data and the
> interface.
I don't think so here. 'S' is usually used to store human-readable
strings, I'd certainly expect to be able to do:
s_array = np.array(['this', 'that'], dtype='S10')
And I'd expect it to work with non-literals that were unicode strings,
i.e. human readable text. In fact, it's pretty rare that I'd ever want
bytes here. So I'd see 'S' mapped to 'U' here.
Francesc Alted wrote:
> the next should still work:
>
> In [2]: s = np.array(['asa'], dtype="S10")
>
> In [3]: s[0]
> Out[3]: 'asa' # will become b'asa' in Python 3
I don't like that -- I put in a string, and get a bytes object back?
> In [4]: s.dtype.itemsize
> Out[4]: 10 # still 1-byte per element
But what it the the strings passed in aren't representable in one byte
per character? Do we define "S" as only supporting ANSI-only string?
what encoding?
Pauli Virtanen wrote:
> 'U'
> is same as Python 3 unicode and probably in same internal representation
> (need to check). Neither is associated with encoding info.
Isn't it? I thought the encoding was always the same internally? so it
is known?
Francesc Alted wrote:
> That could be a good idea because that would ensure compatibility with
> existing NumPy scripts (i.e. old 'string' dtypes are mapped to 'bytes', as it
> should).
What do you mean by compatible? It wold mean a lot of user code would
have to change with the 2->3 transition.
> The only thing that I don't like is that that 'S' seems to be the
> initial letter for 'string', which is actually 'unicode' in Python 3 :-/
> But, for the sake of compatibility, we can probably live with that.
I suppose we could at least depricate it.
>> Also, what will a bytes dtype mean within a py2 program context? Does
>> it matter if the bytes dtype just fails somehow if used in a py2
>> program?
well, it should work in 2.6 anyway.
> Maybe we want to introduce a separate "bytes" dtype that's an alias
> for 'S'?
What do we need "bytes" for? does it support anything that np.uint8
doesn't?
> 2) The field names:
>
> a = array([], dtype=[('a', int)])
> a = array([], dtype=[(b'a', int)])
>
> This is somewhat of an internal issue. We need to decide whether we
> internally coerce input to Unicode or Bytes.
Unicode is clear to me here -- it really should match what Python does
for variable names -- that is unicode in py3k, no?
> 3) Format strings
>
> a = array([], dtype=b'i4')
>
> I don't think it makes sense to handle format strings in Unicode
> internally -- they should always be coerced to bytes.
This should be fine -- we control what is a valid format string, and
thus they can always be ASCII-safe.
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
More information about the NumPy-Discussion
mailing list