[XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
Henry S. Thompson
ht@cogsci.ed.ac.uk
03 May 2000 10:59:28 +0100
Guido van Rossum <guido@python.org> writes:
> Paul, we're both just saying the same thing over and over without
> convincing each other. I'll wait till someone who wasn't in this
> debate before chimes in.
OK, I've never contributed to this discussion, but I have a long
history of shipping widely used Python/Tkinter/XML tools (see my
homepage). I care _very_ much that heretofore I have been unable to
support full XML because of the lack of Unicode support in Python.
I've already started playing with 1.6a2 for this reason.
I notice one apparent mis-communication between the various
contributors:
Treating narrow-strings as consisting of UNICODE code points <= 255 is
not necessarily the same thing as making Latin-1 the default encoding.
I don't think on Paul and Fredrik's account encoding are relevant to
narrow-strings at all.
I'd rather go right away to the coherent position of byte-arrays,
narrow-strings and wide-strings. Encodings are only relevant to
conversion between byte-arrays and strings. Decoding a byte-array
with a UTF-8 encoding into a narrow string might cause
overflow/truncation, just as decoding a byte-array with a UTF-8
encoding into a wide-string might. The fact that decoding a
byte-array with a Latin-1 encoding into a narrow-string is a memcopy
is just a side-effect of the courtesy of the UNICODE designers wrt the
code points between 128 and 255.
This is effectively the way our C-based XML toolset (which we embed in
Python) works today -- we build an 8-bit version which uses char*
strings, and a 16-bit version which uses unsigned short* strings, and
convert from/to byte-streams in any supported encoding at the margins.
I'd like to keep byte-arrays at the margins in Python as well, for all
the reasons advanced by Paul and Fredrik.
I think treating existing strings as a sort of pun between
narrow-strings and byte-arrays is a recipe for ongoing confusion.
ht
--
Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
W3C Fellow 1999--2001, part-time member of W3C Team
2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk
URL: http://www.ltg.ed.ac.uk/~ht/