unicode string alteration
MRAB
python at mrabarnett.plus.com
Thu Aug 12 13:31:29 EDT 2010
BAvant Garde wrote:
> HELP!!!
> I need help with a unicode issue that has me stumped. I must be doing
> something wrong because I don't believe this condition would have
> slipped thru testing.
>
> Wherever the string u'\udbff\udc00' occurs u'\U0010fc00' or
> unichr(1113088) is substituted and the file loses 1 character resulting
> in all trailing characters being shifted out of position. No other
> corrupt strings have been detected.
>
> The condition was noticed while testing in Python 2.6.5 on Ubuntu 10.04
> where the maximum ord # is 1114111 (wide Python build).
>
> Using Python 2.5.4 on Windows-ME where the maximum ord # is 65535
> (narrow Python build) the string u'\U0010fc00' also occurs and it
> "seems" that the substitution takes place but no characters are lost and
> file sizes are ok. Note that ord(u'\U0010fc00') causes the following error:
> "TypeError: ord() expected a character, but string of
> length 2 found"
> The condition is otherwise invisible in 2.5.4 and is handled internally
> without any apparent effect on processing with characters u'\udbff' and
> u'\udc00' each being separately accessible.
>
> The first part of the attachment repeats this email but also has
> examples and illustrates other related oddities.
>
> Any help would be greatly appreciated.
>
It's not an error, it's a "surrogate pair". Surrogate pairs are part of
the Unicode specification.
Unicode codepoints go up to U+0010FFFF.
If you're using 16 bits per codepoint, like in a narrow build of Python,
then the codepoints above U+FFFF _can't_ be represented directly, so
they are represented by a pair of codepoints called a "surrogate pair".
If, on the other hand, you're using 32 bits per codepoint, like in a
wide build of Python, then the codepoints above U+FFFF _can_ be
represented directly, so surrogate pairs aren't needed, and, indeed
shouldn't be there.
What you're seeing in the wide build is Python replacing a surrogate
pair with the codepoint that it represents, which is actually the right
thing to do because, as I said, the surrogate pairs really shouldn't be
there.
More information about the Python-list
mailing list