how to make a unicode string valid for xml?

Hello, I'm looking for a function like xml_unicode(some_unicode_string, 'ignore') that works like unicode(some_string, 'utf8', 'ignore'). Does lxml export such a function? I looked around the source but I didn't see any. Best regards, Burak

Burak Arslan schrieb am 20.03.2015 um 17:50:
I'm looking for a function like xml_unicode(some_unicode_string, 'ignore') that works like unicode(some_string, 'utf8', 'ignore'). Does lxml export such a function? I looked around the source but I didn't see any.
Are you asking for a function that would strip illegal characters from a string? There's no such tool in lxml (and I'm not sure there should be one), but you can easily implement it with a regex by replacing everything but the allowed characters by an empty string. http://www.w3.org/TR/REC-xml/#charsets Something like this: regex = "[^\x09\x0A\x0D\x20-\uD7FF\U00010000-\U0010FFFF]" strip_non_xml_chars = partial(re.compile(regex).sub, '') Stefan
participants (2)
-
Burak Arslan
-
Stefan Behnel