[XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate

Guido van Rossum guido@python.org
Wed, 03 May 2000 08:16:56 -0400


[Henry S. Thompson]
> OK, I've never contributed to this discussion, but I have a long
> history of shipping widely used Python/Tkinter/XML tools (see my
> homepage).  I care _very_ much that heretofore I have been unable to
> support full XML because of the lack of Unicode support in Python.
> I've already started playing with 1.6a2 for this reason.

Thanks for chiming in!

> I notice one apparent mis-communication between the various
> contributors:
> 
> Treating narrow-strings as consisting of UNICODE code points <= 255 is 
> not necessarily the same thing as making Latin-1 the default encoding.
> I don't think on Paul and Fredrik's account encoding are relevant to
> narrow-strings at all.

I agree that's what they are trying to tell me.

> I'd rather go right away to the coherent position of byte-arrays,
> narrow-strings and wide-strings.  Encodings are only relevant to
> conversion between byte-arrays and strings.  Decoding a byte-array
> with a UTF-8 encoding into a narrow string might cause
> overflow/truncation, just as decoding a byte-array with a UTF-8
> encoding into a wide-string might.  The fact that decoding a
> byte-array with a Latin-1 encoding into a narrow-string is a memcopy
> is just a side-effect of the courtesy of the UNICODE designers wrt the 
> code points between 128 and 255.
> 
> This is effectively the way our C-based XML toolset (which we embed in 
> Python) works today -- we build an 8-bit version which uses char*
> strings, and a 16-bit version which uses unsigned short* strings, and
> convert from/to byte-streams in any supported encoding at the margins.
> 
> I'd like to keep byte-arrays at the margins in Python as well, for all 
> the reasons advanced by Paul and Fredrik.
> 
> I think treating existing strings as a sort of pun between
> narrow-strings and byte-arrays is a recipe for ongoing confusion.

Very good analysis.

Unfortunately this is where we're stuck, until we have a chance to
redesign this kind of thing from scratch.  Python 1.5.2 programs use
strings for byte arrays probably as much as they use them for
character strings.  This is because way back in 1990 I when I was
designing Python, I wanted to have smallest set of basic types, but I
also wanted to be able to manipulate byte arrays somewhat.  Influenced
by K&R C, I chose to make strings and string I/O 8-bit clean so that
you could read a binary "string" from a file, manipulate it, and write
it back to a file, regardless of whether it was character or binary
data.

This model has never been challenged until now.  I agree that the Java
model (byte arrays and strings) or perhaps your proposed model (byte
arrays, narrow and wide strings) looks better.  But, although Python
has had rudimentary support for byte arrays for a while (the array
module, introduced in 1993), the majority of Python code manipulating
binary data still uses string objects.

My ASCII proposal is a compromise that tries to be fair to both uses
for strings.  Introducing byte arrays as a more fundamental type has
been on the wish list for a long time -- I see no way to introduce
this into Python 1.6 without totally botching the release schedule
(June 1st is very close already!).  I'd like to be able to move on,
there are other important things still to be added to 1.6 (Vladimir's
malloc patches, Neil's GC, Fredrik's completed sre...).

For 1.7 (which should happen later this year) I promise I'll reopen
the discussion on byte arrays.

--Guido van Rossum (home page: http://www.python.org/~guido/)