Re: [Python-Dev] PEP 263 considered faulty (for some Japanese)

And TextEdit cannot save as UTF-8?
It can. But doing so suffers from "mojibake".
The primary reason why this is not supported is different, though: it would complicate the implementation significantly, atleast the phase 1 implementation. If people contribute a phase 2 implementation that supports the UTF-16 BOM as a side effect, I would personally reconsider.
OK, I will write a sample implementation of the "stage2" as soon as possible, and put it in the public domain. Anyway, until the stage2 comes true, you can write Japanese python files only in either EUC-JP or UTF-8 unless you hack up the interpreter, thus Python remains unsatisfactory to many present Japanese till the day of UTF-8. We should either hurry up or wait still. As for UTF-16 with BOM, any text outside Unicode literals should be translated into UTF-8 (not UTF-16). It is the sole logical consequence in that UTF-8 is strictly ASCII-compatible and able to map all the characters in Unicode naturally. You will write source codes in UTF-16 as follows: s = '<characters>' ... u = unicode(s, 'utf-8') # not utf-16! This suggests me that the implementation will be somewhat like as Stephen J. Turnbull sketches... N.B. one should write a binary (not character, but, say, image or audio) data literal as follows: b = '\x89\xAB\xCD\xEF' The stage2 implementation will translate it into UTF-8 exactly as follows :-) b = '\x89\xAB\xCD\xEF' Hence there is no problem in translating UTF-16 file into UTF-8. (At least, any UTF-16 python file is impossible totally for now, allowing it does not hurt anyone here and there.) -- SUZUKI Hisao >>> def fib(n): return reduce(lambda x, y: suzuki@acm.org ... (x,x[0][-1]+x[1]), [()]*n, ((0L,),1L))

"SUZUKI Hisao" <suzuki@acm.org> writes:
And TextEdit cannot save as UTF-8?
It can. But doing so suffers from "mojibake".
You mean, it won't read it back in properly? Is that because it won't auto-detect the encoding, or does it not even support opening files as UTF-8? Could it be told to write a UTF-8 signature into the file? Would that help autodetection?
Anyway, until the stage2 comes true, you can write Japanese python files only in either EUC-JP or UTF-8 unless you hack up the interpreter, thus Python remains unsatisfactory to many present Japanese till the day of UTF-8. We should either hurry up or wait still.
I expect that the localization patches that circulate now will continue to apply (perhaps with minimal modifications) after stage 1 is implemented. If the patches are enhanced to do the "right thing" (i.e. properly take into consideration the declared encoding, to determine the end of a string), people won't notice the difference compared to a full stage 2 implementation.
As for UTF-16 with BOM, any text outside Unicode literals should be translated into UTF-8 (not UTF-16). It is the sole logical consequence in that UTF-8 is strictly ASCII-compatible and able to map all the characters in Unicode naturally.
Well, no. If UTF-16 is supported as an input encoding in stage 2, it will follow the same process as any other input encoding: The byte strings literals will be converted back to UTF-16. Any patch that special-cases UTF-16 will be rejected.
You will write source codes in UTF-16 as follows:
s = '<characters>' ... u = unicode(s, 'utf-8') # not utf-16!
No, that won't work. Instead, you *should* write u = u'<characters>' No need to call a function.
N.B. one should write a binary (not character, but, say, image or audio) data literal as follows:
b = '\x89\xAB\xCD\xEF'
I completely agree. Binary data should use hex escapes. That will make an interesting challenge for any stage 2 implementation, BTW: \xAB shall denote byte 0x89 no matter what the input encoding was. So you cannot convert \xAB to a Unicode character, and expect conversion to the input encoding to do the right thing. Instead, you must apply the conversion to the source encoding only for the unescaped characters. People had been proposing to introduce b'' strings for binary data, to allow to switch 'plain' strings to denote Unicode strings at some point, but this is a different PEP. Regards, Martin

And TextEdit cannot save as UTF-8?
It can. But doing so suffers from "mojibake".
You mean, it won't read it back in properly?
Yes, it won't.
I expect that the localization patches that circulate now will continue to apply (perhaps with minimal modifications) after stage 1 is implemented. If the patches are enhanced to do the "right thing" (i.e. properly take into consideration the declared encoding, to determine the end of a string), people won't notice the difference compared to a full stage 2 implementation.
You do not need localization patches anymore. I have almost written a sample implementation of "stage2" already. It was so hard that it took 2 days ;-). I will post it to NetNews and python.sf.net in a few days. Yes, I will put it in the public domain. -- SUZUKI Hisao <suzuki@acm.org> <suzuki611@oki.com>

I posted the PEP 263 phase 2 implementation into both NetNews (fj.sources) and the sourceforge.net patch manager (Request ID 534304). Please look in. I am thankful to <stephen@xemacs.org> for giving me a hint on the implementation via his postings to [Python-Dev]. Python programs are represented in UTF-8 internally. It realizes a very high compatibility with the present Python. -- SUZUKI Hisao <suzuki@acm.org> <suzuki611@oki.com>

N.B. one should write a binary (not character, but, say, image or audio) data literal as follows:
b = '\x89\xAB\xCD\xEF'
I completely agree. Binary data should use hex escapes. That will make an interesting challenge for any stage 2 implementation, BTW: \xAB shall denote byte 0x89 no matter what the input encoding was. So you cannot convert \xAB to a Unicode character, and expect conversion to the input encoding to do the right thing. Instead, you must apply the conversion to the source encoding only for the unescaped characters.
Note that it is _not_ a challenge for my implementation at all. You can use your binary strings as they are at present. Please try it.
People had been proposing to introduce b'' strings for binary data, to allow to switch 'plain' strings to denote Unicode strings at some point, but this is a different PEP.
I think you need not introduce b'' strings at all; you can keep it simple as it is. -- SUZUKI Hisao <suzuki@acm.org> <suzuki611@oki.com>

SUZUKI Hisao wrote:
People had been proposing to introduce b'' strings for binary data, to allow to switch 'plain' strings to denote Unicode strings at some point, but this is a different PEP.
I think you need not introduce b'' strings at all; you can keep it simple as it is.
the reason for adding b-strings isn't to keep the implementation simple, it's because we want to get rid of the difference between u-strings and 8-bit text strings in the future. in today's Python, mixing u-strings with 8-bit text is anything but simple. </F>

SUZUKI Hisao <suzuki611@oki.com> writes:
Note that it is _not_ a challenge for my implementation at all. You can use your binary strings as they are at present. Please try it.
Actually, I did (see my comments on sf): In a Unicode string, escape processing of, say, u"\ö" works incorrectly in your implementation, and in a plain string, processing is incorrect if you have an encoding which uses '\' as the second byte.
People had been proposing to introduce b'' strings for binary data, to allow to switch 'plain' strings to denote Unicode strings at some point, but this is a different PEP.
I think you need not introduce b'' strings at all; you can keep it simple as it is.
The rationale is different: people where proposing that all string literals should be Unicode strings - then the question is how to denote byte strings. Regards, Martin
participants (4)
-
Fredrik Lundh
-
martin@v.loewis.de
-
SUZUKI Hisao
-
SUZUKI Hisao