[Python-Dev] bytes / unicode

Glyph Lefkowitz glyph at twistedmatrix.com
Tue Jun 22 07:22:22 CEST 2010


On Jun 21, 2010, at 2:17 PM, P.J. Eby wrote:

> One issue I remember from my "enterprise" days is some of the Asian-language developers at NTT/Verio explaining to me that unicode doesn't actually solve certain issues -- that there are use cases where you really *do* need "bytes plus encoding" in order to properly express something.

The thing that I have heard in passing from a couple of folks with experience in this area is that some older software in asia would present characters differently if they were originally encoded in a "japanese" encoding versus a "chinese" encoding, even though they were really "the same" characters.

I do know that Han Unification is a giant political mess (<http://en.wikipedia.org/wiki/Han_unification> makes for some interesting reading), but my understanding is that it has handled enough of the cases by now that one can write software to display asian languages and it will basically work with a modern version of unicode.  (And of course, there's always the private use area, as Stephen Turnbull pointed out.)

Regardless, this is another example where keeping around a string isn't really enough.  If you need to display a japanese character in a distinct way because you are operating in the japanese *script*, you need a tag surrounding your data that is a hint to its presentation.  The fact that these presentation hints were sometimes determined by their encoding is an unfortunate historical accident.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20100622/34948128/attachment.html>


More information about the Python-Dev mailing list