[Python-Dev] Unicode input issues

Mon, 10 Apr 2000 10:40:19 -0400

Thinking about entering Japanese into raw_input() in IDLE more, I
thought I figured a way to give Takeuchi a Unicode string when he
enters Japanese characters.

I added an experimental patch to the readline method of the PyShell
class: if the line just read, when converted to Unicode, has fewer
characters but still compares equal (and no exceptions happen during
this test) then return the Unicode version.

This doesn't currently work because the built-in raw_input() function
requires that the readline() call it makes internally returns an 8-bit
string.  Should I relax that requirement in general?  (I could also
just replace __builtin__.[raw_]input with more liberal versions
supplied by IDLE.)

I also discovered that the built-in unicode() function is not
idempotent: unicode(unicode('a')) returns u'\000a'.  I think it should
special-case this and return u'a' !

Finally, I believe we need a way to discover the encoding used by
stdin or stdout.  I have to admit I know very little about the file
wrappers that Marc wrote -- is it easy to get the encoding out of
them?  IDLE should probably emulate this, as it's encoding is clearly
UTF-8 (at least when using Tcl 8.1 or newer).

--Guido van Rossum (home page: http://www.python.org/~guido/)