XML/XHTML/HTML differences, bugs... and howto
Stefan Behnel
stefan_ml at behnel.de
Thu Jan 24 01:42:22 EST 2013
Andrew Robinson, 23.01.2013 16:22:
> Good day :),
>
> I've been exploring XML parsers in python; particularly:
> xml.etree.cElementTree; and I'm trying to figure out how to do it
> incrementally, for very large XML files -- although I don't think the
> problems are restricted to incremental parsing.
>
> First problem:
> I've come across an issue where etree silently drops text without telling
> me; and separate.
>
> I am under the impression that XHTML is a subset of XML (eg:defined tags),
> and that once an HTML file is converted to XHTML, the body of the document
> can be handled entirely as XML.
>
> If I convert a (partial/contrived) html file like:
>
> <html>
> <div>
> <p> This is example <b>bold</b> text.
> </div>
> </html>
>
> to XHTML, I might do --right or wrong-- (1):
>
> <html>
> <div>
> <p /> This is example <b>bold</b> text.
> </div>
> </html>
>
> or, alternate difference: (2): "<p> This is example <b>bold</b> text. </p>"
>
> But, when I parse with etree, in example (1) both "This is an example" and
> "text." are dropped;
> The missing text is part of the start, or end event tags, in the
> incrementally parsed method.
>
> Likewise: In example (2), only "text" gets dropped.
Nope, you should read the manual on this. Here's a tutorial:
http://lxml.de/tutorial.html#elements-contain-text
This is using lxml.etree, which is the Python XML library most people use
these days. It's ElementTree compatible, so the tutorial also works for ET
(unless stated otherwise).
> Isn't XML supposed to error out when invalid xml is parsed?
It does.
> I have an XML file which will grow larger than memory on a target machine,
> so here's what I want to do:
>
> Given a source XML file, and a destination file:
> 1) iteratively scan part of the source tree.
> 2) Optionally Modify some of scanned tree.
> 3) Write partial scan/tree out to the destination file.
> 4) Free memory of no-longer needed (partial) source XML.
> 5) continue scanning a new section of the source file... eg: goto step 1
> until source file is exhausted.
>
> But, I don't see a way to write portions of an XML tree, or iteratively
> write a tree to disk.
> How can this be done?
There are several ways to do it. Python has a couple of external libraries
available that are made specifically for generating markup incrementally.
lxml also gained that feature recently. It's not documented yet, but here
are usage examples:
https://github.com/lxml/lxml/blob/master/src/lxml/tests/test_incremental_xmlfile.py
Stefan
More information about the Python-list
mailing list