What encoding does u'...' syntax use?
Ron Garret
rNOSPAMon at flownet.com
Fri Feb 20 19:10:51 EST 2009
In article <499F397C.7030404 at v.loewis.de>,
"Martin v. Löwis" <martin at v.loewis.de> wrote:
> > Yes, I know that. But every concrete representation of a unicode string
> > has to have an encoding associated with it, including unicode strings
> > produced by the Python parser when it parses the ascii string "u'\xb5'"
> >
> > My question is: what is that encoding?
>
> The internal representation is either UTF-16, or UTF-32; which one is
> a compile-time choice (i.e. when the Python interpreter is built).
>
> > Put this another way: I would have thought that when the Python parser
> > parses "u'\xb5'" it would produce the same result as calling
> > unicode('\xb5'), but it doesn't.
>
> Right. In the former case, \xb5 denotes a Unicode character, namely
> U+00B5, MICRO SIGN. It is the same as u"\u00b5", and still the same
> as u"\N{MICRO SIGN}". By "the same", I mean "the very same".
>
> OTOH, unicode('\xb5') is something entirely different. '\xb5' is a
> byte string with length 1, with a single byte with the numeric
> value 0xb5, or 181. It does not, per se, denote any specific character.
> It only gets a character meaning when you try to decode it to unicode,
> which you do with unicode('\xb5'). This is short for
>
> unicode('\xb5', sys.getdefaultencoding())
>
> and sys.getdefaultencoding() is (or should be) "ascii". Now, in
> ASCII, byte 0xb5 does not have a meaning (i.e. it does not denote
> a character at all), hence you get a UnicodeError.
>
> > Instead it seems to produce the same
> > result as calling unicode('\xb5', 'latin-1').
>
> Sure. However, this is only by coincidence, because latin-1 has the same
> code points as Unicode (for 0..255).
>
> > But my default encoding
> > is not latin-1, it's ascii. So where is the Python parser getting its
> > encoding from? Why does parsing "u'\xb5'" not produce the same error as
> > calling unicode('\xb5')?
>
> Because \xb5 *directly* refers to character U+00b5, with no
> byte-oriented encoding in-between.
>
> Regards,
> Martin
OK, I think I get it now. Thanks!
rg
More information about the Python-list
mailing list