[lxml-dev] Spacing and the presence of xml:space="preserve"
Hallo all We are currently using this expression to obtain a plain text version inside a node: For example:
from lxml import etree >>> etree.XPath("string()") string_xpath(etree.fromstring("<a> asdf <b/>fdsa </a>")) ' asdf fdsa '
This works great and returns the string assuming xml:space="preserve", in other words, spacing is taken verbatim. We work on a file format where some of the spacing is very important (XLIFF). We generate such files with xml:space="preserve" in the necessary places. Not everybody generates such files, unfortunately, so we need to also handle the normalised versions. If I rather use the XPath function "normalize-space()", I can get the normalised spacing: 'asdf fdsa' but unfortunately it does this even if xml:space="preserve" is set: >>> etree.XPath("normalize-space()")
string_xpath(etree.fromstring('''<a xml:space="preserve"> asdf <b/>fdsa </a>''')) 'asdf fdsa'
Unfortunately, I don't see a way to get the correct version (normalised by default, but with white-space preserved if xml:space="preserved" is set). Do I have to handle the cases separately, or is there a way for lxml to help me by just doing the right thing? I could special case on the node, but it would be a bit harder to know if some xml:space directive was given higher up in the tree. Or am I missing something in XPath / lxml? Any help would be appreciated. Friedel Wolff -- Recently on my blog: http://translate.org.za/blogs/friedel/en/content/video-virtaals-functionalit...
F Wolff wrote:
We are currently using this expression to obtain a plain text version inside a node:
For example:
from lxml import etree >>> etree.XPath("string()") string_xpath(etree.fromstring("<a> asdf <b/>fdsa </a>")) ' asdf fdsa '
This works great and returns the string assuming xml:space="preserve", in other words, spacing is taken verbatim. We work on a file format where some of the spacing is very important (XLIFF). We generate such files with xml:space="preserve" in the necessary places. Not everybody generates such files, unfortunately, so we need to also handle the normalised versions. If I rather use the XPath function "normalize-space()", I can get the normalised spacing: 'asdf fdsa'
but unfortunately it does this even if xml:space="preserve" is set:
>>> etree.XPath("normalize-space()")
string_xpath(etree.fromstring('''<a xml:space="preserve"> asdf <b/>fdsa </a>''')) 'asdf fdsa'
Unfortunately, I don't see a way to get the correct version (normalised by default, but with white-space preserved if xml:space="preserved" is set). Do I have to handle the cases separately, or is there a way for lxml to help me by just doing the right thing? I could special case on the node, but it would be a bit harder to know if some xml:space directive was given higher up in the tree.
Here is what the XPath 1.0 spec says about normalize-space(): """ Function: string normalize-space(string?) The normalize-space function returns the argument string with whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space. Whitespace characters are the same as those allowed by the S production in XML. If the argument is omitted, it defaults to the context node converted to a string, in other words the string-value of the context node. """ So there is no reference to "xml:space" that would dictate a specific behaviour, neither for the context node nor for subtrees. But have you considered writing the required logic in XSLT instead of plain XPath or Python? The "mode" attribute on XSLT's templates should give you all that's needed here, and you'll still end up with a callable that returns a string (built entirely in C space), just a bit smarter this time. If you do this, please post the stylesheet. I think this might be interesting to others, too. Stefan
participants (2)
-
F Wolff
-
Stefan Behnel