[IPython-dev] String encoding

Thu Jun 16 19:51:55 EDT 2011

Jörgen's found that there's still a unicode problem in trunk, which I don't
think can be completely resolved, but we'd like to get some more opinions on
it.

The issue is that, while we can parse Python code as either unicode or
bytes, there's no way to indicate what encoding is used in either case. So:
1. If we parse as unicode, non-ascii characters which occur inside byte
literals are always interpreted as bytes by encoding with utf-8. This is
what we now do in trunk.
2. If we parse as bytes, bytes occurring inside unicode literals are always
interpreted as characters by decoding with cp1252.

The necessary parameter is not going to be included in Python, because it's
not a problem in Python 3: http://bugs.python.org/issue5911

The practical upshot is platform dependent:
1 will always work correctly when the user enters u"åäö" (unicode literals),
and will coincidentally work with byte literals where the terminal uses
utf-8 (as I believe most Linux and Mac terminals do).
2 will always work as you'd expect when the user enters "åäö" (bytes
literals), and will coincidentally work with unicode literals where the
terminal uses cp1252 (Windows computers in English speaking countries and
parts of Europe).

I'm fairly confident that 1 is the better approach - it's critical that you
can enter those characters as part of a unicode string, but less so that you
can enter them in a byte string. Also, most of our users are on Linux or
Mac, so it should 'just work' for more people.

The question is whether we want to try to include a workaround for other
cases. Where stdin_encoding is cp1252, we could encode each cell as cp1252
before parsing it. That would then behave as expected in the two commonest
situations (UTF-8 unix terminals and cp1252 windows terminals). On the
downside, it's an horrendously ugly hack, and I don't know what it would do
on platforms other than CPython. Jörgen also suggested making it an option,
although I wonder how many people will ever find it.

My own inclination is simply to say that non-ascii characters will be
interpreted correctly in unicode literals, but their behaviour in byte
literals is undefined, and you should use the '\xe9' notation to write bytes
above 127. Note that Python 3 actually enforces this rule: b"ö" is a
SyntaxError. But I'd like to collect some more thoughts, or see if we can
come up with a way to avoid the problem (short of writing our own parser).

Thanks for reading this - it took me a while to properly understand the
problem, so I hope I've explained it clearly.

Thomas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20110617/de7acf8f/attachment.html>