[lxml-dev] Unicode oddness
data:image/s3,"s3://crabby-images/42854/42854d56aeb5e692c2cfc5a56ed831c1322884f0" alt=""
The following seems wrong to me: I have a utf-8 encoded string with html containing the word 'Français':
html = '<html><head><title>Fran\xc3\xa7ais</title></head></html>'
I feed it to lxml.html:
root = lxml.html.fromstring(html)
When I get the text from lxml, it is a unicode string, but it has not been decoded!:
root.text_content() u'Fran\xc3\xa7ais'
The expected output would be decoded unicode, i.e. the result of:
'Fran\xc3\xa7ais'.decode('utf-8') u'Fran\xe7ais'
Or just get back the encoded utf-8 string 'Fran\xc3\xa7ais' Either of these results would make sense and work for me. But the result is an odd confusion of the two. Is this an lxml problem, or have I misunderstood something? Thanks, Adam
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Adam wrote:
I have a utf-8 encoded string with html containing the word 'Français':
html = '<html><head><title>Fran\xc3\xa7ais</title></head></html>'
I feed it to lxml.html:
root = lxml.html.fromstring(html)
When I get the text from lxml, it is a unicode string, but it has not been decoded!:
root.text_content() u'Fran\xc3\xa7ais'
Your HTML snippet lacks a <meta> tag, so the HTMLParser has no way of knowing what encoding your HTML snippet uses. It therefore falls back to assuming Latin-1. If your snippet was encoded in Latin-1, you'd be quite happy about this default. If you know the encoding in advance, you can create your own parser instance and pass it the "encoding" keyword option. There are tools that can try to detect an encoding from a string that you pass in, e.g. chardet. It is, however, impossible for any tool in the world to always recover the missing encoding information for all possible data. Stefan
data:image/s3,"s3://crabby-images/42854/42854d56aeb5e692c2cfc5a56ed831c1322884f0" alt=""
Stefan Behnel <stefan_ml <at> behnel.de> writes:
Your HTML snippet lacks a <meta> tag, so the HTMLParser has no way of knowing what encoding your HTML snippet uses. It therefore falls back to assuming Latin-1. If your snippet was encoded in Latin-1, you'd be quite happy about this default.
If you know the encoding in advance, you can create your own parser instance and pass it the "encoding" keyword option.
Of course! Thank you, I had a feeling I was overlooking something simple.
participants (2)
-
Adam
-
Stefan Behnel