Jörgen's found that there's still a unicode problem in trunk, which I don't think can be completely resolved, but we'd like to get some more opinions on it.<br><br>The issue is that, while we can parse Python code as either unicode or bytes, there's no way to indicate what encoding is used in either case. So:<br>


1. If we parse as unicode, non-ascii characters which occur inside byte literals are always interpreted as bytes by encoding with utf-8. This is what we now do in trunk.<br>2. If we parse as bytes, bytes occurring inside unicode literals are always interpreted as characters by decoding with cp1252.<br>


<br>The necessary parameter is not going to be included in Python, because it's not a problem in Python 3: <a href="http://bugs.python.org/issue5911">http://bugs.python.org/issue5911</a><br><br>The practical upshot is platform dependent:<br>


1 will always work correctly when the user enters u"åäö" (unicode literals), and will coincidentally work with byte literals where the terminal uses utf-8 (as I believe most Linux and Mac terminals do).<br>2 will always work as you'd expect when the user enters "åäö" (bytes literals), and will coincidentally work with unicode literals where the terminal uses cp1252 (Windows computers in English speaking countries and parts of Europe).<br>


<br>I'm fairly confident that 1 is the better approach - it's critical that you can enter those characters as part of a unicode string, but less so that you can enter them in a byte string. Also, most of our users are on Linux or Mac, so it should 'just work' for more people.<br>


<br>The question is whether we want to try to include a workaround for other cases. Where stdin_encoding is cp1252, we could encode each cell as cp1252 before parsing it. That would then behave as expected in the two commonest situations (UTF-8 unix terminals and cp1252 windows terminals). On the downside, it's an horrendously ugly hack, and I don't know what it would do on platforms other than CPython. Jörgen also suggested making it an option, although I wonder how many people will ever find it.<br>


<br>My own inclination is simply to say that non-ascii characters will be interpreted correctly in unicode literals, but their behaviour in byte literals is undefined, and you should use the '\xe9' notation to write bytes above 127. Note that Python 3 actually enforces this rule: b"ö" is a SyntaxError. But I'd like to collect some more thoughts, or see if we can come up with a way to avoid the problem (short of writing our own parser).<br>


<br>Thanks for reading this - it took me a while to properly understand the problem, so I hope I've explained it clearly.<br><br>Thomas<br>