This is using lxml 1.1.2, note the "p" tag:
html = "<head><body><p><div/></p><br></body></html>" parser = etree.HTMLParser() et = etree.parse(StringIO(html), parser) print etree.tostring(et.getroot()) <html><head/><body><p/><div/><br/></body></html>
Now, p tags aren't supposed to contain block level elements: http://www.w3.org/TR/html401/struct/text.html#h-9.3.1 But the page that I'm seeing in the wild is structured that way, and I'd really like it if I could get a tree that represented the original file as closely as possible, even if it's semantically incorrect html (I like it closing <br> tags and such, but I'd really like to be able to, say, round-trip the data). Any idea if this is possible? Should I be taking this up with the libxml2 folks? Thanks, Eli