[lxml-dev] lxml HTMLParser changes the resulting tree
This is using lxml 1.1.2, note the "p" tag:
html = "<head><body><p><div/></p><br></body></html>" parser = etree.HTMLParser() et = etree.parse(StringIO(html), parser) print etree.tostring(et.getroot()) <html><head/><body><p/><div/><br/></body></html>
Now, p tags aren't supposed to contain block level elements: http://www.w3.org/TR/html401/struct/text.html#h-9.3.1 But the page that I'm seeing in the wild is structured that way, and I'd really like it if I could get a tree that represented the original file as closely as possible, even if it's semantically incorrect html (I like it closing <br> tags and such, but I'd really like to be able to, say, round-trip the data). Any idea if this is possible? Should I be taking this up with the libxml2 folks? Thanks, Eli
Greetings! I've run into a few snags related to namespace handling in LXML 1.1.1 (I'm using the bundled Windows distribution). First, given an lxml tree object created from an xml file with a root element that goes something like this: <rootelement xmlns="ns1" xmlns:fred="ns2" xmlns:bob="ns3"> One might expect that the namespace definitions would be preserved upon reserialization. Unfortunately, they are not; what you get using the tostring() function is just <rootelement>. Is there some function or keyword in lxml that will case the namespace definitions to be re-mapped into the output? I've tried to find such a thing if it exists using the docstrings and introspection but I've come up empty-handed. Second, one might think that the nsmap attribute would be 'just the ticket' for performing xpath searches: elements = mytree.xpath('//fred:someelement', mytree.getroot().nsmap) But this fails, because (following from the first example above) the nsmap attribute for <rootelement> yeilds {None:'ns1', 'fred':'ns2', 'bob':'ns3'} Note: the above xpath search succeeds if the None:'ns1' namespace is deleted from nsmap. Using a NoneType as a key in nsmap causes it to be unusable as far as the xpath function is concerned (and probably some other places as well.) Is there any workaround, other than creating (and updating) my own namespace dictionary within my code? Finally, if someone wants to suggest any 'best practices' for working with namespaces in lxml, I'd be very interested in reading them.
Sorry, I made a serious mistake in my last message. For clarity, toss the whole thing and use this instead: Greetings! I've run into a few snags related to namespace handling in LXML 1.1.1 (I'm using the bundled Windows distribution). First, I had hoped that the namespaces defined in the tree's root element nsmap attribute could be automatically included upon re-serialization. For example, given this source xml: <root xmlns="ns1"> <element /> </root> Split off <element> into a new tree: newtree = etree.ElementTree(element) And then reserialize newtree: result = etree.tostring(newtree.getroot()) Even though the root element of newtree contains the same nsmap as the source tree, the resulting output is <element /> and not <element xmlns='ns1' /> Is there some function or keyword in lxml that will case the namespace definitions in the nsmap attribute to be re-mapped into the output? I've tried to find such a thing if it exists using the docstrings and introspection but I've come up empty-handed. Second, one might think that the nsmap attribute would be 'just the ticket' for performing xpath searches: elements = mytree.xpath('//fred:someelement', mytree.getroot().nsmap) But this fails, because (following from the first example above) the nsmap attribute for <rootelement> yeilds {None:'ns1', 'fred':'ns2', 'bob':'ns3'} Note: the above xpath search succeeds if the None:'ns1' namespace is deleted from nsmap. Using a NoneType as a key in nsmap causes it to be unusable as far as the xpath function is concerned (and probably some other places as well.) Is there any workaround, other than creating (and updating) my own namespace dictionary within my code?
Greetings! Here is the problem, as clearly as I can state it: I want to take an xml document, split an element out of it, and use that element as the root of a new document, AND preserve the original document's namespace definitions in the new document. # --- Test 1 --- # tree1 = etree.ElementTree(etree.fromstring('<root xmlns="ns1"><element>boo</element></root>')) print tree1.getroot().nsmap print etree.tostring(tree1.getroot()) tree2 = etree.ElementTree(tree1.getroot().find('{ns1}element')) print tree2.getroot().nsmap print etree.tostring(tree2.getroot()) # --- Test 2 --- # nsdict = {'pre':'ns1'} tree1 = etree.ElementTree(etree.fromstring('<root xmlns="ns1"><element>boo</element></root>')) print tree1.getroot().nsmap print etree.tostring(tree1.getroot()) tree2 = etree.ElementTree(tree1.getroot().xpath('//pre:element', nsdict)[0]) print tree2.getroot().nsmap print etree.tostring(tree2.getroot()) # --- Test 3 --- # nsdict = {'pre':'ns1'} tree1 = etree.ElementTree(etree.fromstring('<pre:root xmlns:pre="ns1"><pre:element>boo</pre:element></pre:root>')) print tree1.getroot().nsmap print etree.tostring(tree1.getroot()) tree2 = etree.ElementTree(tree1.getroot().xpath('//pre:element', nsdict)[0]) print tree2.getroot().nsmap print etree.tostring(tree2.getroot()) In each of the three tests above, the root element of tree2 does indeed contain an nsmap with the correct values, but they are not being written out during reserialization. Is there any to fix this, other than manually adding the definitions as simple attributes just before saving the new document?
participants (2)
-
Eli Stevens (WG.c)
-
Lee Brown