data:image/s3,"s3://crabby-images/7c430/7c430f56c7c1e3f3d8622db4a925310b6455aa6b" alt=""
March 5, 2018
2:58 a.m.
On 14/02/18 17:27, Peng Yu wrote:
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 25-26: surrogates not allowed
I haven't read the rest of the thread, but this error specifically is not lxml's fault. the error message is clear -- surrogates are not allowed. so you need to strip them before feeding the text to lxml. https://stackoverflow.com/a/3158428 Here's what I came up with: XML10_RE = re.compile(u'[^\u0009\n\u0020-\ud7ff\U00010000-\U0010FFFF]', flags=re.UNICODE) Some addt'l xml-related regexps if you need them: https://github.com/arskom/spyne/blob/9ce69afe4fa7139fb1d0c968e66150e1ee19b99... Hth, Burak