On Tue, Jul 27, 2010 at 7:42 AM, Sebastian Haase
The origin of this problem is the fact that Python supports (at least) 2 types of Unicode: 2 bytes and/or 4 bytes per character.
It only supports those two, and that's purely an internal implementation detail. Python can encode unicode in many encodings, but *internally* it has to have some representation of its own, and it can use ucs2 or ucs4. Which one to use is a compile-time flag: --enable-unicode[=ucs[24]]
Additionally, for some incomprehensible reason the Python source code (as downloaded from python.org) defaults to 2ByteUnicode whereas all (major) Linux distributions default to 4ByteUnicode.....
The reason is that many systems (Java, Windows, Qt natively - http://en.wikipedia.org/wiki/Utf-16#Use_in_major_operating_systems_and_envir...) use utf-16 as their native encoding, and ucs2 is a subset of utf-16, so in many environments that makes interoperability easier. But ucs2 can not encode all of unicode, while ucs4 can, so Linux distributions choose to use ucs4 as their internal encoding to ensure that all unicode code points can be encoded in python. This email from Guido explains his position on leaving the ucs2/4 choice up to packagers: http://mail.python.org/pipermail/python-dev/2008-July/080892.html The official Python 2.x unicode story is well explained here: http://docs.python.org/howto/unicode.html and here is the corresponding document for 3.x: http://docs.python.org/release/3.1.2/howto/unicode.html Joel Spolsky has a very nice introduction to the main ideas behind unicode: http://www.joelonsoftware.com/articles/Unicode.html and Matthew Brett has a nice and more concise set of notes on the matter: https://cirl.berkeley.edu/mb312/pydagogue/introducing_unicode.html https://cirl.berkeley.edu/mb312/pydagogue/python_unicode.html I should note that anyone who is thinking of porting any non-trivial amount of code from python 2.x to 3.x will save a lot of time and frustration by spending just a couple of hours reading and understanding the above. It's not that much work, and if you don't understand how Python thinks of strings, you're very likely to make a painful mess in such a code transition effort. I know that the few hours I put into reading the above have already paid off tremendously for us with the zeromq/ipython codebase. Cheers, f