What encoding does u'...' syntax use?

Ron Garret rNOSPAMon at flownet.com
Sat Feb 21 01:10:51 CET 2009


In article <499F397C.7030404 at v.loewis.de>,
 "Martin v. Löwis" <martin at v.loewis.de> wrote:

> > Yes, I know that.  But every concrete representation of a unicode string 
> > has to have an encoding associated with it, including unicode strings 
> > produced by the Python parser when it parses the ascii string "u'\xb5'"
> > 
> > My question is: what is that encoding?
> 
> The internal representation is either UTF-16, or UTF-32; which one is
> a compile-time choice (i.e. when the Python interpreter is built).
> 
> > Put this another way: I would have thought that when the Python parser 
> > parses "u'\xb5'" it would produce the same result as calling 
> > unicode('\xb5'), but it doesn't.
> 
> Right. In the former case, \xb5 denotes a Unicode character, namely
> U+00B5, MICRO SIGN. It is the same as u"\u00b5", and still the same
> as u"\N{MICRO SIGN}". By "the same", I mean "the very same".
> 
> OTOH, unicode('\xb5') is something entirely different. '\xb5' is a
> byte string with length 1, with a single byte with the numeric
> value 0xb5, or 181. It does not, per se, denote any specific character.
> It only gets a character meaning when you try to decode it to unicode,
> which you do with unicode('\xb5'). This is short for
> 
>   unicode('\xb5', sys.getdefaultencoding())
> 
> and sys.getdefaultencoding() is (or should be) "ascii". Now, in
> ASCII, byte 0xb5 does not have a meaning (i.e. it does not denote
> a character at all), hence you get a UnicodeError.
> 
> > Instead it seems to produce the same 
> > result as calling unicode('\xb5', 'latin-1').
> 
> Sure. However, this is only by coincidence, because latin-1 has the same
> code points as Unicode (for 0..255).
> 
> > But my default encoding 
> > is not latin-1, it's ascii.  So where is the Python parser getting its 
> > encoding from?  Why does parsing "u'\xb5'" not produce the same error as 
> > calling unicode('\xb5')?
> 
> Because \xb5 *directly* refers to character U+00b5, with no
> byte-oriented encoding in-between.
> 
> Regards,
> Martin

OK, I think I get it now.  Thanks!

rg



More information about the Python-list mailing list