Problem round-tripping with xml.dom.minidom pretty-printer

Robert Bossy Robert.Bossy at jouy.inra.fr
Fri Feb 29 12:50:31 EST 2008


Ben Butler-Cole wrote:
>> An additional thing to keep in mind is that toprettyxml does not print
>> an XML identical to the original DOM tree: it adds newlines and tabs.
>> When parsed again these blank characters are inserted in the DOM tree as
>> character nodes. If you toprettyxml an XML document twice in a row, then
>> the second one will also add newlines and tabs around the newlines and
>> tabs added by the first. Since you call toprettyxml an infinite number
>> of times, it is expected that lots of blank characters appear.
>>     
>
> Right. That's the behaviour I'm asking about, which I consider to be
> problematic. I would expect a module providing a parser and pretty-
> printer (not just for XML parsers) to be able to conservatively round-
> trip.
>
> As far as I can see (and your comments back this up) minidom doesn't
> have this property. Unless anyone knows how to get it to behave that
> way...
>   
minidom --any DOM parser, btw-- has no means to know which blank 
character is a pretty print artefact or actual blank content from the 
original XML.

You could write a function that strips all-blank nodes recursively down 
the elements tree, before doing so I suggest you take a look at section 
2.10 of http://www.w3.org/TR/REC-xml/.

RB




More information about the Python-list mailing list