What encoding does u'...' syntax use?

Sat Feb 21 15:39:22 EST 2009

On Feb 21, 10:48 am, a... at pythoncraft.com (Aahz) wrote:
> In article <499F397C.7030... at v.loewis.de>,
>
> =?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=  <mar... at v.loewis.de> wrote:
> >> Yes, I know that.  But every concrete representation of a unicode string
> >> has to have an encoding associated with it, including unicode strings
> >> produced by the Python parser when it parses the ascii string "u'\xb5'"
>
> >> My question is: what is that encoding?
>
> >The internal representation is either UTF-16, or UTF-32; which one is
> >a compile-time choice (i.e. when the Python interpreter is built).
>
> Wait, I thought it was UCS-2 or UCS-4?  Or am I misremembering the
> countless threads about the distinction between UTF and UCS?

Nope, that's partly mislabeling and partly a bug.  UCS-2/UCS-4 refer
to Unicode 1.1 and earlier, with no surrogates.  We target Unicode
5.1.

If you naively encode UCS-2 as UTF-8 you really end up with CESU-8.
You miss the step where you combine surrogate pairs (which only exist
in UTF-16) into a single supplementary character.  Lo and behold,
that's actually what current python does in some places.  It's not
pretty.

See bugs #3297 and #3672.