data:image/s3,"s3://crabby-images/42854/42854d56aeb5e692c2cfc5a56ed831c1322884f0" alt=""
April 8, 2009
6:50 p.m.
The following seems wrong to me: I have a utf-8 encoded string with html containing the word 'Français':
html = '<html><head><title>Fran\xc3\xa7ais</title></head></html>'
I feed it to lxml.html:
root = lxml.html.fromstring(html)
When I get the text from lxml, it is a unicode string, but it has not been decoded!:
root.text_content() u'Fran\xc3\xa7ais'
The expected output would be decoded unicode, i.e. the result of:
'Fran\xc3\xa7ais'.decode('utf-8') u'Fran\xe7ais'
Or just get back the encoded utf-8 string 'Fran\xc3\xa7ais' Either of these results would make sense and work for me. But the result is an odd confusion of the two. Is this an lxml problem, or have I misunderstood something? Thanks, Adam