[IPython-dev] String encoding

Fri Jun 17 00:08:44 EDT 2011

Thomas,

Thanks for summarizing this.  Is this the type of thing that we would
want to add a config=True attribute to InteractiveShell to control
this behavior?  We could default to (1), but allow (2) if needed.

Cheers,

Brian

On Thu, Jun 16, 2011 at 4:51 PM, Thomas Kluyver <takowl at gmail.com> wrote:
> Jörgen's found that there's still a unicode problem in trunk, which I don't
> think can be completely resolved, but we'd like to get some more opinions on
> it.
>
> The issue is that, while we can parse Python code as either unicode or
> bytes, there's no way to indicate what encoding is used in either case. So:
> 1. If we parse as unicode, non-ascii characters which occur inside byte
> literals are always interpreted as bytes by encoding with utf-8. This is
> what we now do in trunk.
> 2. If we parse as bytes, bytes occurring inside unicode literals are always
> interpreted as characters by decoding with cp1252.
>
> The necessary parameter is not going to be included in Python, because it's
> not a problem in Python 3: http://bugs.python.org/issue5911
>
> The practical upshot is platform dependent:
> 1 will always work correctly when the user enters u"åäö" (unicode literals),
> and will coincidentally work with byte literals where the terminal uses
> utf-8 (as I believe most Linux and Mac terminals do).
> 2 will always work as you'd expect when the user enters "åäö" (bytes
> literals), and will coincidentally work with unicode literals where the
> terminal uses cp1252 (Windows computers in English speaking countries and
> parts of Europe).
>
> I'm fairly confident that 1 is the better approach - it's critical that you
> can enter those characters as part of a unicode string, but less so that you
> can enter them in a byte string. Also, most of our users are on Linux or
> Mac, so it should 'just work' for more people.
>
> The question is whether we want to try to include a workaround for other
> cases. Where stdin_encoding is cp1252, we could encode each cell as cp1252
> before parsing it. That would then behave as expected in the two commonest
> situations (UTF-8 unix terminals and cp1252 windows terminals). On the
> downside, it's an horrendously ugly hack, and I don't know what it would do
> on platforms other than CPython. Jörgen also suggested making it an option,
> although I wonder how many people will ever find it.
>
> My own inclination is simply to say that non-ascii characters will be
> interpreted correctly in unicode literals, but their behaviour in byte
> literals is undefined, and you should use the '\xe9' notation to write bytes
> above 127. Note that Python 3 actually enforces this rule: b"ö" is a
> SyntaxError. But I'd like to collect some more thoughts, or see if we can
> come up with a way to avoid the problem (short of writing our own parser).
>
> Thanks for reading this - it took me a while to properly understand the
> problem, so I hope I've explained it clearly.
>
> Thomas
>
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
>
>

-- 
Brian E. Granger
Cal Poly State University, San Luis Obispo
bgranger at calpoly.edu and ellisonbg at gmail.com