If I understand how the XML parser works, if it parses an XML string that begins with a header line like

<?xml version="1.0" encoding="utf-8"?>

it will parse the rest of the XML string using the encoding of the encoding it has found. So first of all, does the XML parser do this?

If so, I'd like to know if something like this can be done, either automatically or manually, with the HTML parser.

I have been parsing HTML strings that have been obtained from a random sampling of web sites, and I am wondering if there is a way to do something similar with these HTML strings. Typically, these HTML strings do not start with a header like, but most of them do have a meta tag of the form

<meta charset="utf-8">

or

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" >

I wouldn't expect that the HTML parser reads these tags and automatically sets the encoding to be used while parsing the rest of the HTML string (but I might as well ask: does it?). I could include code in my target's start tag handler that checks meta tags to see if they contain encoding information and calls the parser to set its encoding, but as far as I can see there isn't a way to set the encoding for an HTML parser after it has been created. That is, it looks like you can do this

parser = etree.HTMLParser(target=my_target, encoding=my_encoding)

but not this

parser = etree.HTMLParser(target=my_target)
.... figure out what value my_encoding should have ....
parser.set_encoding(my_encoding)

So I guess my underlying question is, how can I parse HTML strings and give the parser the best chance to use the right encoding for the part of the HTML string that is in the <body> .... </body> section of it?
Thanks,
Mike