toprettyxml messes up with whitespaces

Jorgen Bodde jorgen.maillist at gmail.com
Wed Oct 3 06:18:45 EDT 2007


Hi Paul,

> This seems like a reasonable explanation without having looked at the
> source code myself.

It's by thorough investigation ;-)

> Which part of the standard is this? Here's the XML 1.0 specification's
> section on whitespace:
>
> http://www.w3.org/TR/2006/REC-xml-20060816/#sec-white-space

Well 2.10 if I quote:

<quote>
Such white space is typically not intended for inclusion in the
delivered version of the document. On the other hand, "significant"
white space that should be preserved in the delivered version is
common, for example in poetry and source code.
</quote>

I interpret "significant" whitespaces as the ones between the words,
if whitespaces occur at the beginning of a line due to an indent like

<value>
     This is indented text
</value>

We can assume that the spaces in front of it are not significant
whitespaces. Because when I read the text node in python and it is not
included, I see no reason why it should be preserved. And if it is
preserved in the xml DOM, toprettyxml should first investigate how
many whitespaces are already there before adding more to indent the
text.

Also this happens. First the nodes are properly shown:

<value>
    <a> ... </a>
</value>
<value>
    <a> ... </a>
</value>

When writing back this sometimes happen (mind the blank lines):

<value>
    <a> ... </a>
</value>

<value>
    <a> ... </a>
</value>

And the next time, the spaces between the nodes is expanded again:

<value>
    <a> ... </a>
</value>


<value>
    <a> ... </a>
</value>

(etc) .. so when reading, modifying, writing XML files, the empty
blank lines will grow exponentially.

> It seems to me that applications (and the libraries which serve them)
> can choose what to do unless xml:space is set to "preserve". It does
> seem odd that the toprettyxml method chooses to respect existing
> whitespace whilst also disrupting it by adding more, however.

I would think (simplistic I'm sure) that if spaces are that important,
you can always use a CDATA tag which should treat the text inside as
raw data without any formatting and whitespace changes.

Should I file this as a bug to be solved? I have my workaround now,
but I read online that more people seem to have ran into this.

Regards,
- Jorgen



More information about the Python-list mailing list