Adam Bartoš writes:
Unfortunately, it doesn't work. With PYTHONIOENCODING=utf-8, the sys.std* streams are created with utf-8 encoding (which doesn't help on Windows since they still don't use ReadConsoleW and WriteConsoleW to communicate with the terminal) and after changing the sys.std* streams to the fixed ones and setting readline hook, it still doesn't work,
I don't see why you would expect it to work: either your code is bypassing PYTHONIOENCODING=utf-8 processing, and that variable doesn't matter, or you're feeding already decoded text *as UTF-8* to your module which evidently expects something else (UTF-16LE?).
so presumably the PyCF_SOURCE_IS_UTF8 is still not set.
I don't think that flag does what you think it does. AFAICT from looking at the source, that flag gets unconditionally set in the execution context for compile, eval, and exec, and it is checked in the parser when creating an AST node. So it looks to me like it asserts that the *internal* representation of the program is UTF-8 *after* transforming the input to an internal representation (doing charset decoding, removing comments and line continuations, etc).
Regarding your environment, the repeated use of "custom" is a red flag. Unless you bundle your whole environment with the code you distribute, Python can know nothing about that. In general, Python doesn't know what encoding it is receiving text in.
Well, the received text comes from sys.stdin and its encoding is known.
How? You keep asserting this. *You* know, but how are you passing that information to *the Python interpreter*? Guido may have a time machine, but nobody claims the Python interpreter is telepathic.
Ideally, Python would recieve the text as Unicode String object so there would be no problem with encoding
Forget "ideal". Python 3 was created (among other reasons) to get closer to that ideal. But programs in Python 2 are received as str, which is bytes in an ASCII-compatible encoding, not unicode (unless otherwise specified by PYTHONIOENCODING or a coding cookie in a source file, and as far as I know that's the only ways to specify source encoding). This specification of "Python program" isn't going to change in Python 2; that's one of the major unfixable reasons that Python 2 and Python 3 will be incompatible forever.
The custom stdio streams and readline hooks are set at runtime by a code in sitecustomize. It does not affect IDLE and it is compatible with IPython. I would like to also set PyCF_SOURCE_IS_UTF8 at runtime from Python e.g. via ctypes. But this may be impossible.
Yes. In the latter case, eval has no idea how the bytes given are encoded.
Eval *never* knows how bytes are encoded, not even implicitly. That's one of the important reasons why Python 3 was necessary. I think you know that, but you don't write like you understand the implications for your current work, which makes it hard to communicate.