Hi all
This is not of great importance, but I would like to understand the parser
option 'remove_blank_text' better. I am assuming that 'blank text' refers to
text that contains only whitespace.
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on
win32
>>> from lxml import etree
>>> etree.__version__
'2.3.3'
>>> parser = etree.XMLParser(remove_comments=True, remove_blank_text=True)
>>> test = '<root> <node1> </node1> </root>'
>>> testx = etree.fromstring(test, parser=parser)
>>> etree.tostring(testx)
b'<root><node1> </node1></root>'
>>>
I was expecting to see '<root><node1></node1></root>'
It seems there is a difference between an element that contains only text,
and an element that contains text plus another element. In the first case,
blank text is not discarded, in the second case it is.
>>> test = '<root><node1> <!-- comment --> </node1></root>'
>>> testx = etree.fromstring(test, parser=parser)
>>> etree.tostring(testx)
b'<root><node1> </node1></root>'
>>>
In this case, it removes the comment, and it removes one of the spaces
around the comment. With further testing, I established that it removes the
space before the comment, but not the one after.
The reason that this cropped up is that I am editing an xml file by hand,
and formatting it in a 'pretty-print' fashion. My program then uses
etree.parse to read in the file and create a tree, using 'remove_comments'
and 'remove_blank_text'. It works fine, but when I printed out a section for
debugging, I was surprised to see that most whitespace is stripped, but this
-
<node>
<!-- comment -->
</node>
is stored as '<node>\n </node>', and I was curious to find out why.
Can someone explain the rationale behind these results.
Thanks
Frank Millman