New subject: [lxml-dev] Namespace handling problems in LXML 1.1.1

12 Jan 2007

      This is using lxml 1.1.2, note the "p" tag:
...
...
...
html = "<head><body><p><div/></p><br></body></html>"
parser = etree.HTMLParser()
et = etree.parse(StringIO(html), parser)
print etree.tostring(et.getroot())
<html><head/><body><p/><div/><br/></body></html>
Now, p tags aren't supposed to contain block level elements:

http://www.w3.org/TR/html401/struct/text.html#h-9.3.1

But the page that I'm seeing in the wild is structured that way, and I'd 
really like it if I could get a tree that represented the original file 
as closely as possible, even if it's semantically incorrect html (I like 
it closing <br> tags and such, but I'd really like to be able to, say, 
round-trip the data).

Any idea if this is possible?  Should I be taking this up with the 
libxml2 folks?

Thanks,
Eli

[lxml-dev] lxml HTMLParser changes the resulting tree

Eli Stevens (WG.c)

Lee Brown

Lee Brown

Lee Brown

tags

participants (2)