[lxml-dev] problems with document(''), possibly thread related
I have a stylesheet that uses document('') to reference itself.
The stylesheet works with xsltproc and xmlstarlet on ubuntu 7.10
However when I use it in a threaded wsgi app with lxml 2.11 or 2.0, it
does not work.
I then wrote a simple test case (thinking.. aha, I'll report this
error), but of course the test case functions correctly.
I've spent 4 hours working on this tonight, I'm pooped, and going nuts.
basically given an xml document whose root element is "<root />"
and a stylesheet that has:
Brad Clements wrote:
I have a stylesheet that uses document('') to reference itself.
The stylesheet works with xsltproc and xmlstarlet on ubuntu 7.10
However when I use it in a threaded wsgi app with lxml 2.11 or 2.0, it does not work.
Now that I've had some sleep and another hour of google time, I have
been able to recreate the problem in a test program.
The big clue came from this old thread from 2006:
http://article.gmane.org/gmane.comp.python.lxml.devel/1083/match=document
Basically that post makes me think that the document('') problem is
related to base_url passed to fromstring()
In that, when document('') is processed, the base_url is used to look up
the stylesheet's canonical "URL", and then that URL is used to retrieve
the xml document tree that represents the stylesheet.
The problem here is that base_url could be wrong.. It could be the same
value as some other document. In fact, I can recreate the problem by
setting base_url to the same value for both the xml source and the
stylesheet source.
My understanding of the reason for base_url was just so that resolvers
would have a basis for resolving relative lookups. That is certainly how
I use base_url ... as the only mechanism to set the URL that is passed
to my custom resolver.
It seems to me that after spending more than 5 hours trying to
troubleshoot this "problem" with document(''), I'm going to say that
this is a design flaw in lxml. I'm thinking that using base_url as a way
to get back the original stylesheet XML was convenient for the lxml
developers, but has left a big undocumented pitfall for lxml users.
The only documentation I could find on the website about base_url is on
http://codespeak.net/lxml/parsing.html#parsers where no mention is made
about the requirement to NOT use the same base_url for different documents.
Of course, I could be wrong here and I don't want to get anyone upset by
making invalid claims. My test case program is shown below,
when base_url is the same value for both the stylesheet and the xml
document, then document('') fails in the stylesheet.
If base_url is different, it works.
--------------- test.py -----------
# demonstrate problem with self-reference stylesheet in lxml
# problem occurs when base_uri is the same for both the stylesheet and
# the xml document.
from lxml import etree
class Resolver(etree.Resolver):
def __init__(self):
super(etree.Resolver, self).__init__()
def resolve(self, URL, ID, ctxt):
print "RESOLVE URL %r" % (URL, )
return None
stylesheet_src = """<?xml version="1.0"?>
Hi, Brad Clements wrote:
when document('') is processed, the base_url is used to look up the stylesheet's canonical "URL", and then that URL is used to retrieve the xml document tree that represents the stylesheet.
Yes, it's common to look up a document by its URL. That's an optimisation used by libxslt, too, so if you assign the same URL to different documents, you will run into problems, whether lxml does this or not.
The problem here is that base_url could be wrong.. It could be the same value as some other document. In fact, I can recreate the problem by setting base_url to the same value for both the xml source and the stylesheet source.
You are deliberately lying to lxml, and still expect it to be so kind to do the right thing regardless?
My understanding of the reason for base_url was just so that resolvers would have a basis for resolving relative lookups. That is certainly how I use base_url ... as the only mechanism to set the URL that is passed to my custom resolver.
Yes, that's one way of using it. Others may use it differently.
this is a design flaw in lxml. I'm thinking that using base_url as a way to get back the original stylesheet XML was convenient for the lxml developers, but has left a big undocumented pitfall for lxml users.
And it's easy to work around by providing unique URLs for each document. If you think the documentation should be improved, please submit a patch.
The only documentation I could find on the website about base_url is on http://codespeak.net/lxml/parsing.html#parsers where no mention is made about the requirement to NOT use the same base_url for different documents.
It sounds to me like the misunderstanding here is largely based on what the "base URL" of a document is. It's the URL that defines the origin of the document. Assuming that you will get the same document when you re-read its URL is not that a stupid idea, IMHO. Otherwise, the XSLT processor would have to re-parse a document each time it encounters a document() reference. That would really hurt performance.
My test case program is shown below, when base_url is the same value for both the stylesheet and the xml document, then document('') fails in the stylesheet. If base_url is different, it works.
I agree that separate documentation paragraphs in the parser documentation, the resolver documentation, and the XSLT documentation would help here. Maybe you can write up something? Stefan
Stefan Behnel wrote:
You are deliberately lying to lxml, and still expect it to be so kind to do the right thing regardless?
Well, I didn't realize I was lying.. :-(
It sounds to me like the misunderstanding here is largely based on what the "base URL" of a document is. It's the URL that defines the origin of the document. Assuming that you will get the same document when you re-read its URL is not that a stupid idea, IMHO. Otherwise, the XSLT processor would have to re-parse a document each time it encounters a document() reference. That would really hurt performance.
I agree with what you say. However it's a "surprise" to find that document('') is affected this way. document('') is "expected" to always mean "the current stylesheet" no matter what URL you named the stylesheet with. Could this be improved by having etree.XSLT attach the stylesheet doc to the returned stylesheet object, or is this too hard and tangled up inside libxslt? Is there any documentation on the internal URL caching mechanism? Is the "cache" shared between parsers? Between threads? If I use from_string(base_url="xyz") somewhere, then from a different parser have a stylesheet that does document('xyz'), will my resolver get called, or the document that was generated from_string be used instead? How long are documents and their URLs "cached"? My WSGI code is generating stylesheets "on the fly" based on web requests, so I need to know more about the implementation details of the URL/document caching mechanism. Thanks -- Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com AOL-IM: BKClements
Hi, Brad Clements wrote:
document('') is "expected" to always mean "the current stylesheet" no matter what URL you named the stylesheet with. Could this be improved by having etree.XSLT attach the stylesheet doc to the returned stylesheet object, or is this too hard and tangled up inside libxslt?
The thing is that when a stylesheet says document(''), libxslt will resolve that URL relative to the stylesheet URL (i.e. replace it with that URL) and then ask lxml about that URL. So the only way to see that the stylesheet was meant is to compare the requested URL to the one of the stylesheet. That is identical to the case that you say document("the stylesheet url"). lxml handles this directly without calling a user provided resolver.
Is there any documentation on the internal URL caching mechanism? Is the "cache" shared between parsers? Between threads?
It's local to a single XSLT call. As long as all documents that participate in your XSL transformation (including the stylesheet itself) have unique URLs, you will be safe.
If I use from_string(base_url="xyz") somewhere, then from a different parser have a stylesheet that does document('xyz'), will my resolver get called, or the document that was generated from_string be used instead?
The only document URLs that will not be requested through your resolver are the one of the stylesheet and the one of the document that is being transformed. Everything else will be requested before it is added to the cache.
My WSGI code is generating stylesheets "on the fly" based on web requests, so I need to know more about the implementation details of the URL/document caching mechanism.
Giving each of them a unique base URL should work in any case. Stefan
participants (2)
-
Brad Clements
-
Stefan Behnel