unicode string alteration

MRAB python at mrabarnett.plus.com
Thu Aug 12 19:31:29 CEST 2010


BAvant Garde wrote:
> HELP!!!
> I need help with a unicode issue that has me stumped. I must be doing 
> something  wrong because I don't believe this condition would have 
> slipped thru testing.
> 
> Wherever the string u'\udbff\udc00' occurs u'\U0010fc00' or 
> unichr(1113088) is substituted and the file loses 1 character resulting 
> in all trailing characters being shifted out of position. No other 
> corrupt strings have been detected.
>    
> The condition was noticed while testing in Python 2.6.5 on Ubuntu 10.04 
> where the maximum ord # is 1114111 (wide Python build).
>    
> Using Python 2.5.4 on Windows-ME where the maximum ord # is 65535 
> (narrow Python build) the string u'\U0010fc00' also occurs and it 
> "seems" that the substitution takes place but no characters are lost and 
> file sizes are ok. Note that ord(u'\U0010fc00') causes the following error:
>              "TypeError: ord() expected a character, but string of 
> length 2 found"
> The condition is otherwise invisible in 2.5.4 and is handled internally 
> without any apparent effect on processing with characters u'\udbff' and 
> u'\udc00' each being separately accessible.
> 
> The first part of the attachment repeats this email but also has 
> examples and illustrates other related oddities.
>    
> Any help would be greatly appreciated.
> 
It's not an error, it's a "surrogate pair". Surrogate pairs are part of
the Unicode specification.

Unicode codepoints go up to U+0010FFFF.

If you're using 16 bits per codepoint, like in a narrow build of Python,
then the codepoints above U+FFFF _can't_ be represented directly, so
they are represented by a pair of codepoints called a "surrogate pair".

If, on the other hand, you're using 32 bits per codepoint, like in a
wide build of Python, then the codepoints above U+FFFF _can_ be
represented directly, so surrogate pairs aren't needed, and, indeed
shouldn't be there.

What you're seeing in the wide build is Python replacing a surrogate
pair with the codepoint that it represents, which is actually the right
thing to do because, as I said, the surrogate pairs really shouldn't be
there.



More information about the Python-list mailing list