Re: [I18n-sig] Re: [Python-Dev] Unicode debate

April 28, 2000


      [Paul Prescod]
...
I think that maybe an important point is getting lost here. I could be
wrong, but it seems that all of this emphasis on encodings is misplaced.
In practical applications that manipulate text, encodings creep up all
the time.  I remember a talk or message by Andy Robinson about the
messiness of producing printed reports in Japanese for a large
investment firm.  Most off the issues that took his time had to do
with encodings, if I recall correctly.  (Andy, do you remember what
I'm talking about?  Do you have a URL?)
...
...
The truth of the matter is: the encoding of string objects is in the
mind of the programmer.  When I read a GIF file into a string object,
the encoding is "binary goop".
IMHO, it's a mistake of history that you would even think it makes sense
to read a GIF file into a "string" object and we should be trying to
erase that mistake, as quickly as possible (which is admittedly not very
quickly) not building more and more infrastructure around it. How can we
make the transition to a "binary goops are not strings" world easiest?
I'm afraid that's a bigger issue than we can solve for Python 1.6.
We're committed to by and large backwards compatibility while
supporting Unicode -- the backwards compatibility with tons of
extension module (many 3rd party) requires that we deal with 8-bit
strings in basically the same way as we did before.
...
...
The moral of all this?  8-bit strings are not going away.
If that is a statement of your long term vision, then I think that it is
very unfortunate. Treating string literals as if they were isomorphic
with byte arrays was probably the right thing in 1991 but it won't be in
2005.
I think you're a tad too optimistic about the evolution speed of
software (Windows 2000 *still* has to support DOS programs), but I see
your point.  As I stated in another message, in Python 3000 we'll have
to consider a more Java-esque solution: *character* strings are
Unicode, and for bytes we have (mutable!) byte arras.  Certainly 8-bit
bytes as the smallest storage unit aren't going away.
...
It doesn't meet the definition of string used in the Unicode spec., nor
in XML, nor in Java, nor at the W3C nor in most other up and coming
specifications.
OK, so that's a good indication of where you're coming from.  Maybe
you should spend a little more time in the trenches and a little less
in standards bodies.  Standards are good, but sometimes disconnected
from reality (remember ISO networking? :-).
...
From the W3C site:
""While ISO-2022-JP is not sufficient for every ISO10646 document, it is
the case that ISO10646 is a sufficient document character set for any
entity encoded with ISO-2022-JP.""
And this is exactly why encodings will remain important: entities
encoded in ISO-2022-JP have no compelling reason to be recoded
permanently into ISO10646, and there are lots of forces that make it
convenient to keep it encoded in ISO-2022-JP (like existing tools).
...
http://www.w3.org/MarkUp/html-spec/charset-harmful.html
I know that document well.

--Guido van Rossum (home page: http://www.python.org/~guido/)