Re: [XML-SIG] Python 1.6a2 Unicode experiences?
Andy Robinson wrote:
- you can work with old fashioned strings, which are understood by everyone to be arrays of bytes, and there is no magic conversion going on. The bytes in literal strings in your script file are the bytes that end up in the program.
Who is "everyone"? Are you saying that CP4E hordes are going to understand that the syntax "abcde" is constructing a *byte array*? It seems like you think that Python users are going to be more sophisticated in their understanding of these issues than Java programmers. In most other things, Python is simpler.
...
I'm also convinced that the majority of Python scripts won't need to work in Unicode.
Anything working with XML will need to be Unicode. Anything working with the Win32 API (especially COM) will want to do Unicode. Over time the entire Web infrastructure will move to Unicode. Anything written in JPython pretty much MOST use Unicode (doesn't it?).
Even working with exotic languages, there is always a native 8-bit encoding.
Unicode has many encodings: Shift-JIS, Big-5, EBCDIC ... You can use 8-bit encodings of Unicode if you want. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html
Paul Prescod [paul@prescod.net] wrote:
I'm also convinced that the majority of Python scripts won't need to work in Unicode.
Anything working with XML will need to be Unicode. Anything working with the Win32 API (especially COM) will want to do Unicode. Over time the entire Web infrastructure will move to Unicode. Anything written in JPython pretty much MOST use Unicode (doesn't it?).
I disagree with this. Unicode has been a very long time, and it's not been adopted by a lot of people for a LOT of very valid reasons.
Even working with exotic languages, there is always a native 8-bit encoding.
Unicode has many encodings: Shift-JIS, Big-5, EBCDIC ... You can use 8-bit encodings of Unicode if you want.
Um, if you go: JIS -> Unicode -> JIS you don't get the same thing out that you put in (at least this is what I've been told by a lot of Japanese developers), and therefore it's not terribly popular because of the nature of the Japanese (and Chinese) langauge. My experience with Unicode is that a lot of Western people think it's the answer to every problem asked, while most asian language people disagree vehemently. This says the problem isn't solved yet, even if people wish to deny it. Chris -- | Christopher Petrilli | petrilli@amber.org
[Note: These discussion should all move to 18n-sig... CCing there] Christopher Petrilli wrote:
Paul Prescod [paul@prescod.net] wrote:
Even working with exotic languages, there is always a native 8-bit encoding.
Unicode has many encodings: Shift-JIS, Big-5, EBCDIC ... You can use 8-bit encodings of Unicode if you want.
Um, if you go:
JIS -> Unicode -> JIS
you don't get the same thing out that you put in (at least this is what I've been told by a lot of Japanese developers), and therefore it's not terribly popular because of the nature of the Japanese (and Chinese) langauge.
My experience with Unicode is that a lot of Western people think it's the answer to every problem asked, while most asian language people disagree vehemently. This says the problem isn't solved yet, even if people wish to deny it.
Isn't this a problem of the translation rather than Unicode itself (Andy mentioned several times that you can use the private BMP areas to implement 1-1 round-trips) ? -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
M.-A. Lemburg writes:
Unicode has many encodings: Shift-JIS, Big-5, EBCDIC ... You can use 8-bit encodings of Unicode if you want.
This is meaningless: legacy encodings of national character sets such Shift-JIS, Big Five, GB2312, or TIS620 are not "encodings" of Unicode. TIS620 is a single-byte, 8-bit encoding: each character is represented by a single byte. The Japanese and Chinese encodings are multibyte, 8-bit, encodings. ISO-2022 is a multi-byte, 7-bit encoding for multiple character sets. Unicode has several possible encodings: UTF-8, UCS-2, UCS-4, UTF-16... You can view all of these as 8-bit encodings, if you like. Some are multibyte (such as UTF-8, where each character in Unicode is represented in 1 to 3 bytes) while others are fixed length, two or four bytes per character.
Um, if you go:
JIS -> Unicode -> JIS
you don't get the same thing out that you put in (at least this is what I've been told by a lot of Japanese developers), and therefore it's not terribly popular because of the nature of the Japanese (and Chinese) langauge.
This is simply not true any more. The ability to round trip between Unicode and legacy encodings is dependent on the software: being able to use code points in the PUA for this is acceptable and commonly done. The big advantage is in using Unicode as a pivot when transcoding between different CJK encodings. It is very difficult to map between, say, Shift JIS and GB2312, directly. However, Unicode provides a good go-between. It isn't a panacea: transcoding between legacy encodings like GB2312 and Big Five is still difficult: Unicode or not.
My experience with Unicode is that a lot of Western people think it's the answer to every problem asked, while most asian language people disagree vehemently. This says the problem isn't solved yet, even if people wish to deny it.
This is a shame: it is an indication that they don't understand the technology. Unicode is a tool: nothing more. -tree -- Tom Emerson Basis Technology Corp. Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"
[Note: These discussion should all move to 18n-sig... CCing there]
Christopher Petrilli wrote:
you don't get the same thing out that you put in (at least this is what I've been told by a lot of Japanese developers), and therefore it's not terribly popular because of the nature of the Japanese (and Chinese) langauge.
My experience with Unicode is that a lot of Western people think it's the answer to every problem asked, while most asian language people disagree vehemently. This says the problem isn't solved yet, even if people wish to deny it.
[Marc-Andre Lenburg]
Isn't this a problem of the translation rather than Unicode itself (Andy mentioned several times that you can use the private BMP areas to implement 1-1 round-trips) ?
Maybe, but apparently such high-quality translations are rare (note that Andy said "can"). Anyway, a word of caution here. Years ago I attended a number of IETF meetings on internationalization, in a time when Unicode wasn't as accepted as it is now. The one thing I took away from those meetings was that this is a *highly* emotional and controversial issue. As the Python community, I feel we have no need to discuss "why Unicode." Therein lies madness, controversy, and no progress. We know there's a clear demand for Unicode, and we've committed to support it. The question now at hand is "how Unicode." Let's please focus on that, e.g. in the other thread ("Unicode debate") in i18n-sig and python-dev. --Guido van Rossum (home page: http://www.python.org/~guido/)
participants (5)
-
Christopher Petrilli
-
Guido van Rossum
-
M.-A. Lemburg
-
Paul Prescod
-
Tom Emerson