[I18n-sig] Re: a unicode string on IDLE shell

Mon, 10 Apr 2000 10:20:34 -0400

> Dear Guido,
> 
> I plaied your latest CPython(Python1.6a1) on Win98 Japanese version,
> and found a strange IDLE shell behavior.
> 
> I'm not sure this is a bug or feacher, so I report my story anyway.
> 
> When typing  a Japanese string on IDLE shell with IME ,
> Tk8.3 seems to convert it to a UTF-8 representation.
> Unfortunatly Python does not know this,
> it is dealt with an ordinary string.
> 
> >>> s = raw_input(">>>")
> Type Japanese characters with IME
> for example  $B$"(B
> (This is the first  character of Japanese alphabet, Hiragana)
> >>> s
>  '\343\201\202'   # UTF-8 encoded
> >>> print s
> $B$"(B                     # A proper griph is appear on the screen
> 
> Print statement on IDLE shell works fine with a UTF-8 encoded
> string,however,slice operation or len() does not work.
>  # I know this is a right result
> 
> So I have to convert this string with unicode().
> 
> >>> u = unicode(s)
> >>> u
> u'\u3042'
> >>> print u
> $B$"(B                     # A proper griph is appear on the screen
> 
> Do you think this convertion is unconfortable ?
> 
> I think this behavior is inconsistant with command line Python
> and PythonWin.
> 
> If I want  the same result on command line Python shell or PythonWin shell,
> I have to code as follows;
> >>> s = raw_input(">>>")
> Type Japanese characters with IME
> for example  $B$"(B
> >>>s
> '\202\240'  # Shift-JIS encoded
> >>> print s
> $B$"(B                     # A proper griph is appear on the screen
> >>> u = unicode(s,"mbcs")  # if I use unicode(s) then UnicodeError is raised
> !
> >>>print u.encode("mbcs")  # if I use print u then wrong griph is appear
> $B$"(B                     # A proper griph is appear on the screen
> 
> This  difference is confusing  !!
> I do not have the best solution for this annoyance, I hope at least IDLE
> shell and PythonWin
> shell would have  the same behavior .
> 
> Thank you for reading.
> 
> Best Regards,
> 
>        takeuchi

Dear Takeuchi,

This is a feature.  Tcl/Tk uses UTF-8 to encode Unicode characters
throughout.  This perfectly matches the Python 1.6 default use of
UTF-8 when 8-bit strings are converted to Unicode.  If you want to
manipulate Unicode strings, you have to use unicode() to convert them
to Unicode string objects.

I may change IDLE so that if you enter Unicode, it will automatically
return a Unicode string.  This may break other code though.

Regarding incompatibilities with Pythonwin and command line Python:
note that there you get a different input encoding, but len() and
slicing are also broken until you convert to Unicode using the correct
encoding!  The input encoding is simply different.  I believe this
will always be an issue (but there should be a way to determine what
the input encoding should be!).

If you have more questions about this, please subscribe to the
i18n-sig mailing list (http://www.python.org/sigs/i18n-sig/) -- this
is where issues like this are discussed.  I'm cc'ing this there.

--Guido van Rossum (home page: http://www.python.org/~guido/)