xml.dom.minidom.parse() splitting text nodes?
Martin v. Löwis
martin at v.loewis.de
Fri Jan 17 10:09:01 CET 2003
hawkeye.parker at autodesk.com writes:
> has anyone else run across this issue? can you explain it?
The text nodes are created as the underlying parser (Expat) reports
chunks of text data.
Those data are chunked for various reasons:
- if you have character references or entity references, everything
up to the reference will be reported as a chunk, then the referenced
data will be reported as a chunk, and everything after it will be reported
as a chunk.
- Expat buffers the input in blocks. Everytime the block is exhausted,
its data is reported as a chunk.
You are likely seeing the second case.
This is, strictly speaking, no bug: the DOM reader is entitled to
represent the document in such a way. The minidom implementation in
PyXML will, however, avoid splitting the text nodes if it can.
In general, this issue is what lead to the introduction of the
.normalize method in the DOM; this merges adjacent text nodes
throughout the tree.
More information about the Python-list