What encoding does u'...' syntax use?

Fri Feb 20 18:19:43 EST 2009

>>>>> u'\xb5'
>> u'\xb5'
>>>>> print u'\xb5'
>> �
> 
> Unicode literals are *in the source file*, which can only have one
> encoding (for a given source file).
> 
>> (That last character shows up as a micron sign despite the fact that
>> my default encoding is ascii, so it seems to me that that unicode
>> string must somehow have picked up a latin-1 encoding.)
> 
> I think latin-1 was the default without a coding cookie line.  (May be
> uft-8 in 3.0).

It is, but that's irrelevant for the example. In the source

  u'\xb5'

all characters are ASCII (i.e. all of "letter u", "single
quote", "backslash", "letter x", "letter b", "digit 5").
As a consequence, this source text has the same meaning in all
supported source encodings (as source encodings must be ASCII
supersets).

The Unicode literal shown here does not get its interpretation
from Latin-1. Instead, it directly gets its interpretation from
the Unicode coded character set. The string is a short-hand
for

 u'\u00b5'

and this denotes character U+00B5 (just as u'\u20ac" denotes
U+20AC; the same holds for any other u'\uXXXX').

HTH,
Martin