[XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate

Henry S. Thompson ht@cogsci.ed.ac.uk
03 May 2000 10:59:28 +0100


Guido van Rossum <guido@python.org> writes:

> Paul, we're both just saying the same thing over and over without
> convincing each other.  I'll wait till someone who wasn't in this
> debate before chimes in.

OK, I've never contributed to this discussion, but I have a long
history of shipping widely used Python/Tkinter/XML tools (see my
homepage).  I care _very_ much that heretofore I have been unable to
support full XML because of the lack of Unicode support in Python.
I've already started playing with 1.6a2 for this reason.

I notice one apparent mis-communication between the various
contributors:

Treating narrow-strings as consisting of UNICODE code points <= 255 is 
not necessarily the same thing as making Latin-1 the default encoding.
I don't think on Paul and Fredrik's account encoding are relevant to
narrow-strings at all.

I'd rather go right away to the coherent position of byte-arrays,
narrow-strings and wide-strings.  Encodings are only relevant to
conversion between byte-arrays and strings.  Decoding a byte-array
with a UTF-8 encoding into a narrow string might cause
overflow/truncation, just as decoding a byte-array with a UTF-8
encoding into a wide-string might.  The fact that decoding a
byte-array with a Latin-1 encoding into a narrow-string is a memcopy
is just a side-effect of the courtesy of the UNICODE designers wrt the 
code points between 128 and 255.

This is effectively the way our C-based XML toolset (which we embed in 
Python) works today -- we build an 8-bit version which uses char*
strings, and a 16-bit version which uses unsigned short* strings, and
convert from/to byte-streams in any supported encoding at the margins.

I'd like to keep byte-arrays at the margins in Python as well, for all 
the reasons advanced by Paul and Fredrik.

I think treating existing strings as a sort of pun between
narrow-strings and byte-arrays is a recipe for ongoing confusion.

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2001, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/