Changing encoding while parsing an HTML string

If I understand how the XML parser works, if it parses an XML string that begins with a header line like <?xml version="1.0" encoding="utf-8"?> it will parse the rest of the XML string using the encoding of the encoding it has found. So first of all, does the XML parser do this? If so, I'd like to know if something like this can be done, either automatically or manually, with the HTML parser. I have been parsing HTML strings that have been obtained from a random sampling of web sites, and I am wondering if there is a way to do something similar with these HTML strings. Typically, these HTML strings do not start with a header like, but most of them do have a meta tag of the form <meta charset="utf-8"> or <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" > I wouldn't expect that the HTML parser reads these tags and automatically sets the encoding to be used while parsing the rest of the HTML string (but I might as well ask: does it?). I could include code in my target's start tag handler that checks meta tags to see if they contain encoding information and calls the parser to set its encoding, but as far as I can see there isn't a way to set the encoding for an HTML parser after it has been created. That is, it looks like you can do this parser = etree.HTMLParser(target=my_target, encoding=my_encoding) but not this parser = etree.HTMLParser(target=my_target) .... figure out what value my_encoding should have .... parser.set_encoding(my_encoding) So I guess my underlying question is, how can I parse HTML strings and give the parser the best chance to use the right encoding for the part of the HTML string that is in the <body> .... </body> section of it? Thanks, Mike

Michael O'Leary, 24.05.2013 21:22:
If I understand how the XML parser works, if it parses an XML string that begins with a header line like
<?xml version="1.0" encoding="utf-8"?>
it will parse the rest of the XML string using the encoding of the encoding it has found. So first of all, does the XML parser do this?
It's an XML parser, so, yes, it does.
If so, I'd like to know if something like this can be done, either automatically or manually, with the HTML parser.
I have been parsing HTML strings that have been obtained from a random sampling of web sites, and I am wondering if there is a way to do something similar with these HTML strings. Typically, these HTML strings do not start with a header like, but most of them do have a meta tag of the form
<meta charset="utf-8">
or
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" >
The latter is the normal way of doing it in HTML.
I wouldn't expect that the HTML parser reads these tags and automatically sets the encoding to be used while parsing the rest of the HTML string (but I might as well ask: does it?).
Yes, it does.
I could include code in my target's start tag handler that checks meta tags to see if they contain encoding information and calls the parser to set its encoding, but as far as I can see there isn't a way to set the encoding for an HTML parser after it has been created. That is, it looks like you can do this
parser = etree.HTMLParser(target=my_target, encoding=my_encoding)
but not this
parser = etree.HTMLParser(target=my_target) .... figure out what value my_encoding should have .... parser.set_encoding(my_encoding)
So I guess my underlying question is, how can I parse HTML strings and give the parser the best chance to use the right encoding for the part of the HTML string that is in the <body> .... </body> section of it?
It should do it automatically if it can, defaulting to Latin-1 if there is no content-type meta tag that specifies the encoding. If you know the encoding in advance for some reason, you can configure an HTMLParser with it and pass it into the parse() function (for example). If you have different encodings in your files, use differently configured parsers. You cannot change the encoding once the parser has started parsing. If you feel like changing it along the way, I suggest you just start over. You might also want to consider using iterparse() instead of passing a target object into the parser. It has less overhead. Stefan
participants (2)
-
Michael O'Leary
-
Stefan Behnel