[Python-Dev] Unicode compromise?
Guido van Rossum
guido@python.org
Tue, 02 May 2000 16:47:30 -0400
> I could live with this compromise as long as we document that a future
> version may use the "character is a character" model. I just don't want
> people to start depending on a catchable exception being thrown because
> that would stop us from ever unifying unmarked literal strings and
> Unicode strings.
Agreed (as I've said before).
> --
>
> Are there any steps we could take to make a future divorce of strings
> and byte arrays easier? What if we added a
>
> binary_read()
>
> function that returns some form of byte array. The byte array type could
> be just like today's string type except that its type object would be
> distinct, it wouldn't have as many string-ish methods and it wouldn't
> have any auto-conversion to Unicode at all.
You can do this now with the array module, although clumsily:
>>> import array
>>> f = open("/core", "rb")
>>> a = array.array('B', [0]) * 1000
>>> f.readinto(a)
1000
>>>
Or if you wanted to read raw Unicode (UTF-16):
>>> a = array.array('H', [0]) * 1000
>>> f.readinto(a)
2000
>>> u = unicode(a, "utf-16")
>>>
There are some performance issues, e.g. you have to initialize the
buffer somehow and that seems a bit wasteful.
> People could start to transition code that reads non-ASCII data to the
> new function. We could put big warning labels on read() to state that it
> might not always be able to read data that is not in some small set of
> recognized encodings (probably UTF-8 and UTF-16).
>
> Or perhaps binary_open(). Or perhaps both.
>
> I do not suggest just using the text/binary flag on the existing open
> function because we cannot immediately change its behavior without
> breaking code.
A new method makes most sense -- there are definitely situations where
you want to read in text mode for a while and then switch to binary
mode (e.g. HTTP).
I'd like to put this off until after Python 1.6 -- but it deserves
attention.
--Guido van Rossum (home page: http://www.python.org/~guido/)