xsl and unicode surrogate characters

Sakcee sakcee at gmail.com
Thu Jan 5 01:44:10 EST 2006


thanks very much for the info, it really helped

we are using the text from file to display on webpage and we have a
method for conversion the parsed data to utf-8 and then displaying, all
the data looks fine after parsing except the
surrogate pair,
since i can not guess what it was supposed to be , is it ok to strip it
 using regex re.complie(' [\xed|\xa0] ')?




Martin v. Löwis wrote:
> Sakcee wrote:
> > Hi
> >
> > In one of the data files that I have , I am seeing these characters
> > \xed\xa0\xa0 .  They seem to break the xsl.
> [...]
> > is this a unicode utf-16 surrogate pair ?
>
> Yes and no. This is the UTF-8 encoding of U+D820, which is a high
> surrogate code point. So yes. It's not yet a pair; there would have to
> be a second such code point. So no.
>
> Furthermore, in UTF-8, you should never ever have encoded surrogate
> codes; instead, whoever generated the UTF-8 should have combined the
> two surrogate code point into a single coded character, and should
> have encoded *that* character. So no - this byte sequence isn't
> even valid UTF-8.
>
> > for displaying it on xml/xsl, should I extract only \xa0?
>
> You should tell your parser to reject the file as ill-formed.
>
> > since this is hingher than 00-7f range can i just strip it?
>
> Depending an what you want to achieve: sure! It will modify
> the meaning of the bytes, of course.
>
> > under what condition the encoding software put this string in?
> 
> If it has a bug.
> 
> Regards,
> Martin




More information about the Python-list mailing list