xml.dom.minidom.parse() splitting text nodes?

Martin v. Löwis martin at v.loewis.de
Fri Jan 17 04:09:01 EST 2003


hawkeye.parker at autodesk.com writes:

> has anyone else run across this issue?  can you explain it?

The text nodes are created as the underlying parser (Expat) reports
chunks of text data.

Those data are chunked for various reasons:
- if you have character references or entity references, everything
  up to the reference will be reported as a chunk, then the referenced
  data will be reported as a chunk, and everything after it will be reported
  as a chunk.
- Expat buffers the input in blocks. Everytime the block is exhausted,
  its data is reported as a chunk.

You are likely seeing the second case.

This is, strictly speaking, no bug: the DOM reader is entitled to
represent the document in such a way. The minidom implementation in
PyXML will, however, avoid splitting the text nodes if it can.

In general, this issue is what lead to the introduction of the
.normalize method in the DOM; this merges adjacent text nodes
throughout the tree.

Regards,
Martin





More information about the Python-list mailing list