how to make a unicode string valid for xml?
data:image/s3,"s3://crabby-images/7c430/7c430f56c7c1e3f3d8622db4a925310b6455aa6b" alt=""
Hello, I'm looking for a function like xml_unicode(some_unicode_string, 'ignore') that works like unicode(some_string, 'utf8', 'ignore'). Does lxml export such a function? I looked around the source but I didn't see any. Best regards, Burak
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Burak Arslan schrieb am 20.03.2015 um 17:50:
Are you asking for a function that would strip illegal characters from a string? There's no such tool in lxml (and I'm not sure there should be one), but you can easily implement it with a regex by replacing everything but the allowed characters by an empty string. http://www.w3.org/TR/REC-xml/#charsets Something like this: regex = "[^\x09\x0A\x0D\x20-\uD7FF\U00010000-\U0010FFFF]" strip_non_xml_chars = partial(re.compile(regex).sub, '') Stefan
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Burak Arslan schrieb am 20.03.2015 um 17:50:
Are you asking for a function that would strip illegal characters from a string? There's no such tool in lxml (and I'm not sure there should be one), but you can easily implement it with a regex by replacing everything but the allowed characters by an empty string. http://www.w3.org/TR/REC-xml/#charsets Something like this: regex = "[^\x09\x0A\x0D\x20-\uD7FF\U00010000-\U0010FFFF]" strip_non_xml_chars = partial(re.compile(regex).sub, '') Stefan
participants (2)
-
Burak Arslan
-
Stefan Behnel