encoding in lxml

Stefan Behnel stefan_ml at behnel.de
Mon Nov 3 21:01:05 CET 2008


jasiu85 wrote:
> I have a problem with character encoding in LXML. Here's how it goes:
> 
> I read an HTML document from a third-party site. It is supposed to be
> in UTF-8, but unfortunately from time to time it's not.

You can instantiate your own HTML parser and pass encoding="utf-8". That way,
when it's not UTF-8, you will get an exception at parse time, which allows you
to reparse the document with another encoding (say, ISO-8859-1) to get the
correct content.

Stefan



More information about the Python-list mailing list