[I18n-sig] Re: [Python-Dev] Unicode debate
Guido van Rossum
guido@python.org
Fri, 28 Apr 2000 10:50:05 -0400
[Paul Prescod]
> I think that maybe an important point is getting lost here. I could be
> wrong, but it seems that all of this emphasis on encodings is misplaced.
In practical applications that manipulate text, encodings creep up all
the time. I remember a talk or message by Andy Robinson about the
messiness of producing printed reports in Japanese for a large
investment firm. Most off the issues that took his time had to do
with encodings, if I recall correctly. (Andy, do you remember what
I'm talking about? Do you have a URL?)
> > The truth of the matter is: the encoding of string objects is in the
> > mind of the programmer. When I read a GIF file into a string object,
> > the encoding is "binary goop".
>
> IMHO, it's a mistake of history that you would even think it makes sense
> to read a GIF file into a "string" object and we should be trying to
> erase that mistake, as quickly as possible (which is admittedly not very
> quickly) not building more and more infrastructure around it. How can we
> make the transition to a "binary goops are not strings" world easiest?
I'm afraid that's a bigger issue than we can solve for Python 1.6.
We're committed to by and large backwards compatibility while
supporting Unicode -- the backwards compatibility with tons of
extension module (many 3rd party) requires that we deal with 8-bit
strings in basically the same way as we did before.
> > The moral of all this? 8-bit strings are not going away.
>
> If that is a statement of your long term vision, then I think that it is
> very unfortunate. Treating string literals as if they were isomorphic
> with byte arrays was probably the right thing in 1991 but it won't be in
> 2005.
I think you're a tad too optimistic about the evolution speed of
software (Windows 2000 *still* has to support DOS programs), but I see
your point. As I stated in another message, in Python 3000 we'll have
to consider a more Java-esque solution: *character* strings are
Unicode, and for bytes we have (mutable!) byte arras. Certainly 8-bit
bytes as the smallest storage unit aren't going away.
> It doesn't meet the definition of string used in the Unicode spec., nor
> in XML, nor in Java, nor at the W3C nor in most other up and coming
> specifications.
OK, so that's a good indication of where you're coming from. Maybe
you should spend a little more time in the trenches and a little less
in standards bodies. Standards are good, but sometimes disconnected
from reality (remember ISO networking? :-).
> From the W3C site:
>
> ""While ISO-2022-JP is not sufficient for every ISO10646 document, it is
> the case that ISO10646 is a sufficient document character set for any
> entity encoded with ISO-2022-JP.""
And this is exactly why encodings will remain important: entities
encoded in ISO-2022-JP have no compelling reason to be recoded
permanently into ISO10646, and there are lots of forces that make it
convenient to keep it encoded in ISO-2022-JP (like existing tools).
> http://www.w3.org/MarkUp/html-spec/charset-harmful.html
I know that document well.
--Guido van Rossum (home page: http://www.python.org/~guido/)