What encoding does u'...' syntax use?
rhamph at gmail.com
Sat Feb 21 21:39:22 CET 2009
On Feb 21, 10:48 am, a... at pythoncraft.com (Aahz) wrote:
> In article <499F397C.7030... at v.loewis.de>,
> =?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?= <mar... at v.loewis.de> wrote:
> >> Yes, I know that. But every concrete representation of a unicode string
> >> has to have an encoding associated with it, including unicode strings
> >> produced by the Python parser when it parses the ascii string "u'\xb5'"
> >> My question is: what is that encoding?
> >The internal representation is either UTF-16, or UTF-32; which one is
> >a compile-time choice (i.e. when the Python interpreter is built).
> Wait, I thought it was UCS-2 or UCS-4? Or am I misremembering the
> countless threads about the distinction between UTF and UCS?
Nope, that's partly mislabeling and partly a bug. UCS-2/UCS-4 refer
to Unicode 1.1 and earlier, with no surrogates. We target Unicode
If you naively encode UCS-2 as UTF-8 you really end up with CESU-8.
You miss the step where you combine surrogate pairs (which only exist
in UTF-16) into a single supplementary character. Lo and behold,
that's actually what current python does in some places. It's not
See bugs #3297 and #3672.
More information about the Python-list