[lxml-dev] Unicode oddness

April 8, 2009

      The following seems wrong to me:

I have a utf-8 encoded string with html containing the word 'Français':
...
...
...
html = '<html><head><title>Fran\xc3\xa7ais</title></head></html>'
I feed it to lxml.html:
...
...
...
root = lxml.html.fromstring(html)
When I get the text from lxml, it is a unicode string, but it has not been 
decoded!:
...
...
...
root.text_content()
u'Fran\xc3\xa7ais'
The expected output would be decoded unicode, i.e. the result of:
...
...
...
'Fran\xc3\xa7ais'.decode('utf-8')
u'Fran\xe7ais'
Or just get back the encoded utf-8 string 'Fran\xc3\xa7ais'

Either of these results would make sense and work for me. But the result is an 
odd confusion of the two. Is this an lxml problem, or have I misunderstood 
something?

Thanks, Adam