[I18n-sig] Re: a unicode string on IDLE shell
Guido van Rossum
guido@python.org
Mon, 10 Apr 2000 10:20:34 -0400
> Dear Guido,
>
> I plaied your latest CPython(Python1.6a1) on Win98 Japanese version,
> and found a strange IDLE shell behavior.
>
> I'm not sure this is a bug or feacher, so I report my story anyway.
>
> When typing a Japanese string on IDLE shell with IME ,
> Tk8.3 seems to convert it to a UTF-8 representation.
> Unfortunatly Python does not know this,
> it is dealt with an ordinary string.
>
> >>> s = raw_input(">>>")
> Type Japanese characters with IME
> for example $B$"(B
> (This is the first character of Japanese alphabet, Hiragana)
> >>> s
> '\343\201\202' # UTF-8 encoded
> >>> print s
> $B$"(B # A proper griph is appear on the screen
>
> Print statement on IDLE shell works fine with a UTF-8 encoded
> string,however,slice operation or len() does not work.
> # I know this is a right result
>
> So I have to convert this string with unicode().
>
> >>> u = unicode(s)
> >>> u
> u'\u3042'
> >>> print u
> $B$"(B # A proper griph is appear on the screen
>
> Do you think this convertion is unconfortable ?
>
> I think this behavior is inconsistant with command line Python
> and PythonWin.
>
> If I want the same result on command line Python shell or PythonWin shell,
> I have to code as follows;
> >>> s = raw_input(">>>")
> Type Japanese characters with IME
> for example $B$"(B
> >>>s
> '\202\240' # Shift-JIS encoded
> >>> print s
> $B$"(B # A proper griph is appear on the screen
> >>> u = unicode(s,"mbcs") # if I use unicode(s) then UnicodeError is raised
> !
> >>>print u.encode("mbcs") # if I use print u then wrong griph is appear
> $B$"(B # A proper griph is appear on the screen
>
> This difference is confusing !!
> I do not have the best solution for this annoyance, I hope at least IDLE
> shell and PythonWin
> shell would have the same behavior .
>
> Thank you for reading.
>
> Best Regards,
>
> takeuchi
Dear Takeuchi,
This is a feature. Tcl/Tk uses UTF-8 to encode Unicode characters
throughout. This perfectly matches the Python 1.6 default use of
UTF-8 when 8-bit strings are converted to Unicode. If you want to
manipulate Unicode strings, you have to use unicode() to convert them
to Unicode string objects.
I may change IDLE so that if you enter Unicode, it will automatically
return a Unicode string. This may break other code though.
Regarding incompatibilities with Pythonwin and command line Python:
note that there you get a different input encoding, but len() and
slicing are also broken until you convert to Unicode using the correct
encoding! The input encoding is simply different. I believe this
will always be an issue (but there should be a way to determine what
the input encoding should be!).
If you have more questions about this, please subscribe to the
i18n-sig mailing list (http://www.python.org/sigs/i18n-sig/) -- this
is where issues like this are discussed. I'm cc'ing this there.
--Guido van Rossum (home page: http://www.python.org/~guido/)