Adam Bartoš writes:
I'll describe my picture of the situation, which might be terribly wrong. On Linux, in a typical situation, we have a UTF-8 terminal, PYTHONENIOENCODING=utf-8, GNU readline is used. When the REPL wants input from a user the tokenizer calls PyOS_Readline, which calls GNU readline. The user is prompted >>> , during the input he can use autocompletion and everything and he enters u'α'. PyOS_Readline returns b"u'\xce\xb1'" (as char* or something),
It's char*, according to Parser/myreadline.c. It is not str in Python 2.
which is UTF-8 encoded input from the user.
By default, it's just ASCII-compatible bytes. I don't know offhand where, but somehow PYTHONIOENCODING tells Python it's UTF-8 -- that's how Python knows about it in this situation.
The tokenizer, parser, and evaluator process the input and the result is u'\u03b1', which is printed as an answer.
In my case I install custom sys.std* objects and a custom readline hook. Again, the tokenizer calls PyOS_Readline, which calls my readline hook, which calls sys.stdin.readline(),
This is your custom version?
which returns an Unicode string a user entered (it was decoded from UTF-16-LE bytes actually). My readline hook encodes this string to UTF-8 and returns it. So the situation is the same. The tokenizer gets b"\u'xce\xb1'" as before, but know it results in u'\xce\xb1'.
Why is the result different?
The result is different because Python doesn't "learn" that the actual encoding is UTF-8. If you have tried setting PYTHONIOENCODING=utf-8 with your setup and that doesn't work, I'm not sure where the communication is failing. The only other thing I can think of is to set the encoding sys.stdin.encoding. That may be readonly, though (that would explain why the only way to set the PYTHONIOENCODING is via an environment variable). At least you could find out what it is, with and without PYTHONIOENCODING set to 'utf-8' (or maybe it's 'utf8' or 'UTF-8' -- all work as expected with unicode.encode/str.decode on Mac OS X). Or it could be unimplemented in your replacement module.
I though that in the first case PyCF_SOURCE_IS_UTF8 might have been set. And after your suggestion, I thought that PYTHONIOENCODING=utf-8 is the thing that also sets PyCF_SOURCE_IS_UTF8.
No. PyCF_SOURCE_IS_UTF8 is set unconditionally in the functions builtin_{eval,exec,compile}_impl in Python/bltins.c in the cases that matter AFAICS. It's not obvious to me under what conditions it might *not* be set. It is then consulted in ast.c in PyAST_FromNodeObject, and nowhere else that I can see.