[Python-Dev] Unicode literals in Python 2.7

Sat May 2 18:02:54 CEST 2015

Adam Bartoš writes:

 > I'll describe my picture of the situation, which might be terribly wrong.
 > On Linux, in a typical situation, we have a UTF-8 terminal,
 > PYTHONENIOENCODING=utf-8, GNU readline is used. When the REPL wants input
 > from a user the tokenizer calls PyOS_Readline, which calls GNU readline.
 > The user is prompted >>> , during the input he can use autocompletion and
 > everything and he enters u'α'. PyOS_Readline returns b"u'\xce\xb1'" (as
 > char* or something),

It's char*, according to Parser/myreadline.c.  It is not str in Python
2.

 > which is UTF-8 encoded input from the user.

By default, it's just ASCII-compatible bytes.  I don't know offhand
where, but somehow PYTHONIOENCODING tells Python it's UTF-8 -- that's
how Python knows about it in this situation.

 > The tokenizer, parser, and evaluator process the input and the result is
 > u'\u03b1', which is printed as an answer.
 >
 > In my case I install custom sys.std* objects and a custom readline
 > hook.  Again, the tokenizer calls PyOS_Readline, which calls my
 > readline hook, which calls sys.stdin.readline(),

This is your custom version?

 > which returns an Unicode string a user entered (it was decoded from
 > UTF-16-LE bytes actually). My readline hook encodes this string to
 > UTF-8 and returns it. So the situation is the same.  The tokenizer
 > gets b"\u'xce\xb1'" as before, but know it results in u'\xce\xb1'.
 > 
 > Why is the result different?

The result is different because Python doesn't "learn" that the actual
encoding is UTF-8.  If you have tried setting PYTHONIOENCODING=utf-8
with your setup and that doesn't work, I'm not sure where the
communication is failing.

The only other thing I can think of is to set the encoding
sys.stdin.encoding.  That may be readonly, though (that would explain
why the only way to set the PYTHONIOENCODING is via an environment
variable).  At least you could find out what it is, with and without
PYTHONIOENCODING set to 'utf-8' (or maybe it's 'utf8' or 'UTF-8' --
all work as expected with unicode.encode/str.decode on Mac OS X).

Or it could be unimplemented in your replacement module.

 > I though that in the first case PyCF_SOURCE_IS_UTF8 might have been
 > set. And after your suggestion, I thought that
 > PYTHONIOENCODING=utf-8 is the thing that also sets
 > PyCF_SOURCE_IS_UTF8.

No.  PyCF_SOURCE_IS_UTF8 is set unconditionally in the functions
builtin_{eval,exec,compile}_impl in Python/bltins.c in the cases that
matter AFAICS.  It's not obvious to me under what conditions it might
*not* be set.  It is then consulted in ast.c in PyAST_FromNodeObject,
and nowhere else that I can see.