[I18n-sig] Re: [Python-Dev] Unicode debate

Guido van Rossum guido@python.org
Fri, 28 Apr 2000 10:50:05 -0400


[Paul Prescod]
> I think that maybe an important point is getting lost here. I could be
> wrong, but it seems that all of this emphasis on encodings is misplaced.

In practical applications that manipulate text, encodings creep up all
the time.  I remember a talk or message by Andy Robinson about the
messiness of producing printed reports in Japanese for a large
investment firm.  Most off the issues that took his time had to do
with encodings, if I recall correctly.  (Andy, do you remember what
I'm talking about?  Do you have a URL?)

> > The truth of the matter is: the encoding of string objects is in the
> > mind of the programmer.  When I read a GIF file into a string object,
> > the encoding is "binary goop".  
> 
> IMHO, it's a mistake of history that you would even think it makes sense
> to read a GIF file into a "string" object and we should be trying to
> erase that mistake, as quickly as possible (which is admittedly not very
> quickly) not building more and more infrastructure around it. How can we
> make the transition to a "binary goops are not strings" world easiest?

I'm afraid that's a bigger issue than we can solve for Python 1.6.
We're committed to by and large backwards compatibility while
supporting Unicode -- the backwards compatibility with tons of
extension module (many 3rd party) requires that we deal with 8-bit
strings in basically the same way as we did before.

> > The moral of all this?  8-bit strings are not going away.  
> 
> If that is a statement of your long term vision, then I think that it is
> very unfortunate. Treating string literals as if they were isomorphic
> with byte arrays was probably the right thing in 1991 but it won't be in
> 2005.

I think you're a tad too optimistic about the evolution speed of
software (Windows 2000 *still* has to support DOS programs), but I see
your point.  As I stated in another message, in Python 3000 we'll have
to consider a more Java-esque solution: *character* strings are
Unicode, and for bytes we have (mutable!) byte arras.  Certainly 8-bit
bytes as the smallest storage unit aren't going away.

> It doesn't meet the definition of string used in the Unicode spec., nor
> in XML, nor in Java, nor at the W3C nor in most other up and coming
> specifications.

OK, so that's a good indication of where you're coming from.  Maybe
you should spend a little more time in the trenches and a little less
in standards bodies.  Standards are good, but sometimes disconnected
from reality (remember ISO networking? :-).

> From the W3C site:
> 
> ""While ISO-2022-JP is not sufficient for every ISO10646 document, it is
> the case that ISO10646 is a sufficient document character set for any
> entity encoded with ISO-2022-JP.""

And this is exactly why encodings will remain important: entities
encoded in ISO-2022-JP have no compelling reason to be recoded
permanently into ISO10646, and there are lots of forces that make it
convenient to keep it encoded in ISO-2022-JP (like existing tools).

> http://www.w3.org/MarkUp/html-spec/charset-harmful.html

I know that document well.

--Guido van Rossum (home page: http://www.python.org/~guido/)