[Python-Dev] Unicode literals in Python 2.7

Stephen J. Turnbull stephen at xemacs.org
Fri May 1 06:14:02 CEST 2015

Adam Bartoš writes:

 > Unfortunately, it doesn't work. With PYTHONIOENCODING=utf-8, the
 > sys.std* streams are created with utf-8 encoding (which doesn't
 > help on Windows since they still don't use ReadConsoleW and
 > WriteConsoleW to communicate with the terminal) and after changing
 > the sys.std* streams to the fixed ones and setting readline hook,
 > it still doesn't work,

I don't see why you would expect it to work: either your code is
bypassing PYTHONIOENCODING=utf-8 processing, and that variable doesn't
matter, or you're feeding already decoded text *as UTF-8* to your
module which evidently expects something else (UTF-16LE?).

 > so presumably the PyCF_SOURCE_IS_UTF8 is still not set.

I don't think that flag does what you think it does.  AFAICT from
looking at the source, that flag gets unconditionally set in the
execution context for compile, eval, and exec, and it is checked in
the parser when creating an AST node.  So it looks to me like it
asserts that the *internal* representation of the program is UTF-8
*after* transforming the input to an internal representation (doing
charset decoding, removing comments and line continuations, etc).

 > > Regarding your environment, the repeated use of "custom" is a red
 > > flag.  Unless you bundle your whole environment with the code you
 > > distribute, Python can know nothing about that.  In general, Python
 > > doesn't know what encoding it is receiving text in.
 > Well, the received text comes from sys.stdin and its encoding is
 > known.

How?  You keep asserting this.  *You* know, but how are you passing
that information to *the Python interpreter*?  Guido may have a time
machine, but nobody claims the Python interpreter is telepathic.

 > Ideally, Python would recieve the text as Unicode String object so
 > there would be no problem with encoding

Forget "ideal".  Python 3 was created (among other reasons) to get
closer to that ideal.  But programs in Python 2 are received as str,
which is bytes in an ASCII-compatible encoding, not unicode (unless
otherwise specified by PYTHONIOENCODING or a coding cookie in a source
file, and as far as I know that's the only ways to specify source
encoding).  This specification of "Python program" isn't going to
change in Python 2; that's one of the major unfixable reasons that
Python 2 and Python 3 will be incompatible forever.

 > The custom stdio streams and readline hooks are set at runtime by a
 > code in sitecustomize. It does not affect IDLE and it is compatible
 > with IPython. I would like to also set PyCF_SOURCE_IS_UTF8 at
 > runtime from Python e.g. via ctypes. But this may be impossible.

 > Yes. In the latter case, eval has no idea how the bytes given are
 > encoded.

Eval *never* knows how bytes are encoded, not even implicitly.  That's
one of the important reasons why Python 3 was necessary.  I think you
know that, but you don't write like you understand the implications
for your current work, which makes it hard to communicate.

More information about the Python-Dev mailing list