Trying to understand remove_blank_text
data:image/s3,"s3://crabby-images/daa5d/daa5d257007d3e894f1c005fd0af8b880ca4e368" alt=""
Hi all This is not of great importance, but I would like to understand the parser option 'remove_blank_text' better. I am assuming that 'blank text' refers to text that contains only whitespace. Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win32
from lxml import etree etree.__version__ '2.3.3' parser = etree.XMLParser(remove_comments=True, remove_blank_text=True) test = '<root> <node1> </node1> </root>' testx = etree.fromstring(test, parser=parser) etree.tostring(testx) b'<root><node1> </node1></root>'
I was expecting to see '<root><node1></node1></root>' It seems there is a difference between an element that contains only text, and an element that contains text plus another element. In the first case, blank text is not discarded, in the second case it is.
test = '<root><node1> <!-- comment --> </node1></root>' testx = etree.fromstring(test, parser=parser) etree.tostring(testx) b'<root><node1> </node1></root>'
In this case, it removes the comment, and it removes one of the spaces around the comment. With further testing, I established that it removes the space before the comment, but not the one after. The reason that this cropped up is that I am editing an xml file by hand, and formatting it in a 'pretty-print' fashion. My program then uses etree.parse to read in the file and create a tree, using 'remove_comments' and 'remove_blank_text'. It works fine, but when I printed out a section for debugging, I was surprised to see that most whitespace is stripped, but this - <node> <!-- comment --> </node> is stored as '<node>\n </node>', and I was curious to find out why. Can someone explain the rationale behind these results. Thanks Frank Millman
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi, Frank Millman, 14.02.2012 10:37:
This is not of great importance, but I would like to understand the parser option 'remove_blank_text' better. I am assuming that 'blank text' refers to text that contains only whitespace.
Yes.
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win32
from lxml import etree etree.__version__ '2.3.3'
Note that the lxml version isn't really relevant here. The option is just being passed on as a flag to the parser in libxml2.
parser = etree.XMLParser(remove_comments=True, remove_blank_text=True) test = '<root> <node1> </node1> </root>' testx = etree.fromstring(test, parser=parser) etree.tostring(testx) b'<root><node1> </node1></root>'
I was expecting to see '<root><node1></node1></root>'
I agree that this may seem surprising. In retrospect, "remove_ignorable_whitespace", although another couple of characters longer, might have been a better name.
It seems there is a difference between an element that contains only text, and an element that contains text plus another element. In the first case, blank text is not discarded, in the second case it is.
test = '<root><node1> <!-- comment --> </node1></root>' testx = etree.fromstring(test, parser=parser) etree.tostring(testx) b'<root><node1> </node1></root>'
In this case, it removes the comment, and it removes one of the spaces around the comment. With further testing, I established that it removes the space before the comment, but not the one after.
Sounds somewhat reasonable.
The reason that this cropped up is that I am editing an xml file by hand, and formatting it in a 'pretty-print' fashion. My program then uses etree.parse to read in the file and create a tree, using 'remove_comments' and 'remove_blank_text'. It works fine, but when I printed out a section for debugging, I was surprised to see that most whitespace is stripped, but this -
<node> <!-- comment --> </node>
is stored as '<node>\n </node>', and I was curious to find out why.
Can someone explain the rationale behind these results.
The main use case of the option is to remove formatting whitespace, that 'explains' the behaviour in the first case (without the comment). I don't know how the interaction with comments works (see the code for details), but there may be room for improvements there. My guess is that it's some kind of issue with the sequence of SAX events. For an explanation, the key term here is "ignorable whitespace". There is a certain notion of that in the XML world (e.g. in XSLT), however, the XML specification does not define (or even mention) it, which means that whitespace is always considered data by the XML parsing specification. Now, when a schema or DTD is involved in the parsing process, it specifically defines what parts of the structure contain data and which not. This is the place where "ignorable whitespace" becomes well defined. In your case, you are not using any kind of schema, so the parser can only apply a heuristic to determine which whitespace it can "safely enough" consider "ignorable whitespace" to drop it without discarding real data that the user might still be interested in. A heuristic always has the disadvantage of doing the right thing only in some cases and being too conservative (or too aggressive) in others. Does this help? Stefan
data:image/s3,"s3://crabby-images/daa5d/daa5d257007d3e894f1c005fd0af8b880ca4e368" alt=""
From: lxml-bounces@lxml.de [mailto:lxml-bounces@lxml.de]
The main use case of the option is to remove formatting whitespace, that 'explains' the behaviour in the first case (without the comment). I don't know how the interaction with comments works (see the code for details), but there may be room for improvements there. My guess is that it's some kind of issue with the sequence of SAX events.
For an explanation, the key term here is "ignorable whitespace". There is a certain notion of that in the XML world (e.g. in XSLT), however, the XML specification does not define (or even mention) it, which means that whitespace is always considered data by the XML parsing specification.
Now, when a schema or DTD is involved in the parsing process, it specifically defines what parts of the structure contain data and which not. This is the place where "ignorable whitespace" becomes well defined.
In your case, you are not using any kind of schema, so the parser can only apply a heuristic to determine which whitespace it can "safely enough" consider "ignorable whitespace" to drop it without discarding real data that the user might still be interested in. A heuristic always has the disadvantage of doing the right thing only in some cases and being too conservative (or too aggressive) in others.
Does this help?
Stefan
My curiosity is sufficiently satisfied :-) Thanks, Stefan Frank
participants (2)
-
Frank Millman
-
Stefan Behnel