[lxml-dev] The difference between str(xslt_result) and xslt_result.write()
Hi, Is there a difference between the output from lxml.etree._XSLTResultTree.__str__() and lxml.etree._XSLTResultTree.write()? I'm trying to chase some output whitespace issues around, and I'm wondering if there's a difference in how serialisation is handled. Also, str(_XSLTResultTree) is failing for me with errors like "'ascii' codec can't encode character u'\xa9' in position 1608: ordinal not in range(128)" because I have unicode characters in my documents... lxml.etree._XSLTResultTree doesn't define a __unicode__ method so I can't use a unicode coercion to UTF-8 or something like that... All in all though, I'm really enjoying lxml: I spent a long time working with libxml & lbxslt's standard python interfaces and they much more of a pain than lxml! Thanks, Matt -- Matt Patterson | Design & Code <matt at reprocessed org> | http://www.reprocessed.org/
Hi Matt, Matt Patterson schrieb:
Is there a difference between the output from lxml.etree._XSLTResultTree.__str__() and lxml.etree._XSLTResultTree.write()?
Yes. str() knows about the output method chosen in the stylesheet (xsl:output), write() doesn't. If you call write(), you will end up with the XML tree serialization you requested in the call arguments. If you call str(), you will get the serialized result you requested in the XSL transform.
Also, str(_XSLTResultTree) is failing for me with errors like "'ascii' codec can't encode character u'\xa9' in position 1608: ordinal not in range(128)" because I have unicode characters in my documents...
Then you have likely forgotten to set an output encoding in your stylesheet.
lxml.etree._XSLTResultTree doesn't define a __unicode__ method so I can't use a unicode coercion to UTF-8 or something like that...
I don't think __unicode__ would make sense here, given the fact that stylesheets determine the output encoding. Python unicode strings usually have a different encoding than the one you specify in your stylesheet. If you're in doubt, 'UTF-8' is commonly a good choice in lxml, as it's the encoding we use internally.
All in all though, I'm really enjoying lxml: I spent a long time working with libxml & lbxslt's standard python interfaces and they much more of a pain than lxml!
I guess "much more of a pain" is meant in a positive sense here, although it sounds somewhat tainted due to the actual extent to which libxml2's bindings really are a pain... :) Stefan
Hi again, sorry, I was partially mistaken in my last post. You have actually found a bug. Stefan Behnel wrote:
Matt Patterson wrote:
Also, str(_XSLTResultTree) is failing for me with errors like "'ascii' codec can't encode character u'\xa9' in position 1608: ordinal not in range(128)" because I have unicode characters in my documents...
Then you have likely forgotten to set an output encoding in your stylesheet.
Actually, you most likely have /not/ forgotten to do so. lxml was mishandling the case where the output encoding is not compatible with UTF-8. A safe work-around is to always use UTF-8 here, although the bug will be fixed in the next release.
lxml.etree._XSLTResultTree doesn't define a __unicode__ method so I can't use a unicode coercion to UTF-8 or something like that...
I don't think __unicode__ would make sense here, given the fact that stylesheets determine the output encoding.
Since this problem is based on a bug, this gets me closer to the point of accepting that __unicode__ makes sense here. Otherwise, there would be no other way to retrieve a unicode string from a stylesheet result - except for recoding by hand after calling str(), which is rather ugly. The question is how to make this play nicely. We know the requested output encoding from the stylesheet, so when the user calls unicode() on the result, she/he is actually requesting a recoding, which is not always efficient. But then, that's the user's fault. Another thing is that the serialized result may have an XML encoding declaration. To be correct, we have to remove it in this case, as the encoding information is only correctly provided by the unicode string semantics. This may additionally mean that we have to copy the majority of the string (as unicode objects!). So, I believe the best solution is to document that UTF-8 is the best choice as an output encoding in that case and otherwise leave it to the Python codecs. If users want to use other encodings that are not supported by Python, they will get a sensible exception automatically. I changed it on the trunk for now, but if there are any proposals or objections to this, I'd like to hear about them. Stefan
participants (2)
-
Matt Patterson
-
Stefan Behnel