Mailman 3 Changing encoding while parsing an HTML string - lxml - The Python XML Toolkit

May 24, 2013

      If I understand how the XML parser works, if it parses an XML string that
begins with a header line like

<?xml version="1.0" encoding="utf-8"?>

it will parse the rest of the XML string using the encoding of the encoding
it has found. So first of all, does the XML parser do this?

If so, I'd like to know if something like this can be done, either
automatically or manually, with the HTML parser.

I have been parsing HTML strings that have been obtained from a random
sampling of web sites, and I am wondering if there is a way to do something
similar with these HTML strings. Typically, these HTML strings do not start
with a header like, but most of them do have a meta tag of the form

<meta charset="utf-8">

or

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" >

I wouldn't expect that the HTML parser reads these tags and automatically
sets the encoding to be used while parsing the rest of the HTML string (but
I might as well ask: does it?). I could include code in my target's start
tag handler that checks meta tags to see if they contain encoding
information and calls the parser to set its encoding, but as far as I can
see there isn't a way to set the encoding for an HTML parser after it has
been created. That is, it looks like you can do this

parser = etree.HTMLParser(target=my_target, encoding=my_encoding)

but not this

parser = etree.HTMLParser(target=my_target)
.... figure out what value my_encoding should have ....
parser.set_encoding(my_encoding)

So I guess my underlying question is, how can I parse HTML strings and give
the parser the best chance to use the right encoding for the part of the
HTML string that is in the <body> .... </body> section of it?
Thanks,
Mike

Changing encoding while parsing an HTML string

Michael O'Leary

Stefan Behnel

tags

participants (2)