just a bug (was: xml.dom.minidom: how to preserve CRLF's inside CDATA?)

Marc 'BlackJack' Rintsch bj_666 at gmx.net
Fri May 25 05:45:23 EDT 2007


In <1180082137.329142.45350 at p77g2000hsh.googlegroups.com>, sim.sim wrote:

> Below the code that tryes to parse an well-formed xml, but it fails
> with error message:
> "not well-formed (invalid token): line 3, column 85"

How did you verified that it is well formed?  `xmllint` barf on it too.

> The "problem" within CDATA-section: it consists a part of utf-8
> encoded string wich was splited (widely used for memory limited
> devices).
> 
> When minidom parses the xml-string, it fails becouse it tryes to convert
> into unicode the data within CDATA-section, insted of just to return the
> value of the section "as is". The convertion contradicts the
> specification http://www.w3.org/TR/REC-xml/#sec-cdata-sect

An XML document contains unicode characters, so does the CDTATA section.
CDATA is not meant to put arbitrary bytes into a document.  It must
contain valid characters of this type
http://www.w3.org/TR/REC-xml/#NT-Char (linked from the grammar of CDATA in
your link above).

Ciao,
	Marc 'BlackJack' Rintsch



More information about the Python-list mailing list