[I18n-sig] Re: [Python-Dev] Unicode debate

Paul Prescod paul@prescod.net
Sat, 29 Apr 2000 09:18:05 -0500


Guido van Rossum wrote:
> 
> [Paul Prescod]
> > I think that maybe an important point is getting lost here. I could be
> > wrong, but it seems that all of this emphasis on encodings is misplaced.
> 
> In practical applications that manipulate text, encodings creep up all
> the time.  

I'm not saying that encodings are unimportant. I'm saying that that they
are *different* than what Fredrik was talking about. He was talking
about a coherent logical model for characters and character strings
based on the conventions of more modern languages and systems than C and
Python.

> > How can we
> > make the transition to a "binary goops are not strings" world easiest?
> 
> I'm afraid that's a bigger issue than we can solve for Python 1.6.

I understand that we can't fix the problem now. I just think that we
shouldn't go out of our ways to make it worst.

If we make byte-array strings "magically" cast themselves into
character-strings, people will expect that behavior forever.

> > It doesn't meet the definition of string used in the Unicode spec., nor
> > in XML, nor in Java, nor at the W3C nor in most other up and coming
> > specifications.
> 
> OK, so that's a good indication of where you're coming from.  Maybe
> you should spend a little more time in the trenches and a little less
> in standards bodies.  Standards are good, but sometimes disconnected
> from reality (remember ISO networking? :-).

As far as I know, XML and Java are used a fair bit in the real
world...even somewhat in Asia. In fact, there is a book titled "XML and
Java" written by three Japanese men.

> And this is exactly why encodings will remain important: entities
> encoded in ISO-2022-JP have no compelling reason to be recoded
> permanently into ISO10646, and there are lots of forces that make it
> convenient to keep it encoded in ISO-2022-JP (like existing tools).

You cannot recode an ISO-2022-JP document into ISO10646 because 10646 is
a character *set* and not an encoding. ISO-2022-JP says how you should
represent characters in terms of bits and bytes. ISO10646 defines a
mapping from integers to characters.

They are both important, but separate. I think that this automagical
re-encoding conflates them.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html