Mailman 3 [lxml-dev] problems with document(''), possibly thread related - lxml - The Python XML Toolkit

[lxml-dev] problems with document(''), possibly thread related

Brad Clements

13 Aug 2008 13 Aug '08

5:29 a.m.

I have a stylesheet that uses document('') to reference itself. The stylesheet works with xsltproc and xmlstarlet on ubuntu 7.10 However when I use it in a threaded wsgi app with lxml 2.11 or 2.0, it does not work. I then wrote a simple test case (thinking.. aha, I'll report this error), but of course the test case functions correctly. I've spent 4 hours working on this tonight, I'm pooped, and going nuts. basically given an xml document whose root element is "<root />" and a stylesheet that has: From within the threaded wsgi app, the output I get from this is "root", but from the test case and from xsltproc, I get "xsl:stylesheet" My code is more or less like this: ss_parser = etree.XMLParser(load_dtd=True) ss_parser.resolvers.add(Resolver()) stylesheet_doc = etree.fromstring(stylesheet_src, ss_parser, base_url='http://mystylesheet.xsl') stylesheet = etree.XSLT(stylesheet_doc) doc_parser = etree.XMLParser(load_dtd=True) doc_parser.resolvers.add(Resolver()) xml_doc = etree.fromstring(xml_src, doc_parser, base_url='http://myfile.xml') however base_url is some real value when called from wsgi, it's threaded, and my Resolver.resolve method does get called in the wsgi app, but not from the test app. Before I give up, can someone suggest ways in which using lxml from within a threaded app might somehow "break" resolving document(''), but non-threaded it works ok? I don't think I'm using the same parser object for the stylesheet and xml document, the real wsgi code is a tad complicated. However the stylesheet and xml document should be parsed and used within the same thread (which just happens to not be the main thread) I believe this works ok on lxml 1.1.2, but I've already updated my code to use 'base_url' and so forth and I'm too worn out to change all that code just to test a theory. So .. any ideas on what could cause this? thanks for any suggestions.. -- Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com AOL-IM: BKClements

Show replies by date

Brad Clements

14 Aug 14 Aug

2:20 a.m.

New subject: [lxml-dev] problems with document(''), possibly thread related - LXML 'BUG'

Brad Clements wrote:

...

I have a stylesheet that uses document('') to reference itself.

The stylesheet works with xsltproc and xmlstarlet on ubuntu 7.10

However when I use it in a threaded wsgi app with lxml 2.11 or 2.0, it does not work.

Now that I've had some sleep and another hour of google time, I have been able to recreate the problem in a test program. The big clue came from this old thread from 2006: http://article.gmane.org/gmane.comp.python.lxml.devel/1083/match=document Basically that post makes me think that the document('') problem is related to base_url passed to fromstring() In that, when document('') is processed, the base_url is used to look up the stylesheet's canonical "URL", and then that URL is used to retrieve the xml document tree that represents the stylesheet. The problem here is that base_url could be wrong.. It could be the same value as some other document. In fact, I can recreate the problem by setting base_url to the same value for both the xml source and the stylesheet source. My understanding of the reason for base_url was just so that resolvers would have a basis for resolving relative lookups. That is certainly how I use base_url ... as the only mechanism to set the URL that is passed to my custom resolver. It seems to me that after spending more than 5 hours trying to troubleshoot this "problem" with document(''), I'm going to say that this is a design flaw in lxml. I'm thinking that using base_url as a way to get back the original stylesheet XML was convenient for the lxml developers, but has left a big undocumented pitfall for lxml users. The only documentation I could find on the website about base_url is on http://codespeak.net/lxml/parsing.html#parsers where no mention is made about the requirement to NOT use the same base_url for different documents. Of course, I could be wrong here and I don't want to get anyone upset by making invalid claims. My test case program is shown below, when base_url is the same value for both the stylesheet and the xml document, then document('') fails in the stylesheet. If base_url is different, it works. --------------- test.py ----------- # demonstrate problem with self-reference stylesheet in lxml # problem occurs when base_uri is the same for both the stylesheet and # the xml document. from lxml import etree class Resolver(etree.Resolver): def __init__(self): super(etree.Resolver, self).__init__() def resolve(self, URL, ID, ctxt): print "RESOLVE URL %r" % (URL, ) return None stylesheet_src = """<?xml version="1.0"?> http://www.w3.org/1999/XSL/Transform" xmlns:xf="http://www.w3.org/2002/xforms" xmlns:const="const.uri" version="1.0" exclude-result-prefixes="const"> <data> <test>Hi!</test> </data> <html xmlns="http://www.w3.org/1999/xhtml"> <head> </head> <body> <div> xf model id: <br /> expected value is: location-selector-model </div> </body> </html> """ xml_src = """<?xml version="1.0"?> <root />""" def test(): ss_parser = etree.XMLParser(load_dtd=True) ss_parser.resolvers.add(Resolver()) stylesheet_doc = etree.fromstring(stylesheet_src, ss_parser, base_url='http://myfile.xml') stylesheet = etree.XSLT(stylesheet_doc) doc_parser = etree.XMLParser(load_dtd=True) doc_parser.resolvers.add(Resolver()) xml_doc = etree.fromstring(xml_src, doc_parser, base_url='http://myfile.xml') print "%s" % stylesheet(xml_doc) if __name__ == "__main__": test() -- Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com AOL-IM: BKClements

Stefan Behnel

6:11 p.m.

New subject: [lxml-dev] problems with document(''), possibly thread related - LXML 'BUG'

Hi, Brad Clements wrote:

...

when document('') is processed, the base_url is used to look up the stylesheet's canonical "URL", and then that URL is used to retrieve the xml document tree that represents the stylesheet.

Yes, it's common to look up a document by its URL. That's an optimisation used by libxslt, too, so if you assign the same URL to different documents, you will run into problems, whether lxml does this or not.

...

The problem here is that base_url could be wrong.. It could be the same value as some other document. In fact, I can recreate the problem by setting base_url to the same value for both the xml source and the stylesheet source.

You are deliberately lying to lxml, and still expect it to be so kind to do the right thing regardless?

...

My understanding of the reason for base_url was just so that resolvers would have a basis for resolving relative lookups. That is certainly how I use base_url ... as the only mechanism to set the URL that is passed to my custom resolver.

Yes, that's one way of using it. Others may use it differently.

...

this is a design flaw in lxml. I'm thinking that using base_url as a way to get back the original stylesheet XML was convenient for the lxml developers, but has left a big undocumented pitfall for lxml users.

And it's easy to work around by providing unique URLs for each document. If you think the documentation should be improved, please submit a patch.

...

The only documentation I could find on the website about base_url is on http://codespeak.net/lxml/parsing.html#parsers where no mention is made about the requirement to NOT use the same base_url for different documents.

It sounds to me like the misunderstanding here is largely based on what the "base URL" of a document is. It's the URL that defines the origin of the document. Assuming that you will get the same document when you re-read its URL is not that a stupid idea, IMHO. Otherwise, the XSLT processor would have to re-parse a document each time it encounters a document() reference. That would really hurt performance.

...

My test case program is shown below, when base_url is the same value for both the stylesheet and the xml document, then document('') fails in the stylesheet. If base_url is different, it works.

I agree that separate documentation paragraphs in the parser documentation, the resolver documentation, and the XSLT documentation would help here. Maybe you can write up something? Stefan

Brad Clements

7:07 p.m.

New subject: [lxml-dev] problems with document(''), possibly thread related - LXML 'BUG'

Stefan Behnel wrote:

...

You are deliberately lying to lxml, and still expect it to be so kind to do the right thing regardless?

Well, I didn't realize I was lying.. :-(

...

It sounds to me like the misunderstanding here is largely based on what the "base URL" of a document is. It's the URL that defines the origin of the document. Assuming that you will get the same document when you re-read its URL is not that a stupid idea, IMHO. Otherwise, the XSLT processor would have to re-parse a document each time it encounters a document() reference. That would really hurt performance.

I agree with what you say. However it's a "surprise" to find that document('') is affected this way. document('') is "expected" to always mean "the current stylesheet" no matter what URL you named the stylesheet with. Could this be improved by having etree.XSLT attach the stylesheet doc to the returned stylesheet object, or is this too hard and tangled up inside libxslt? Is there any documentation on the internal URL caching mechanism? Is the "cache" shared between parsers? Between threads? If I use from_string(base_url="xyz") somewhere, then from a different parser have a stylesheet that does document('xyz'), will my resolver get called, or the document that was generated from_string be used instead? How long are documents and their URLs "cached"? My WSGI code is generating stylesheets "on the fly" based on web requests, so I need to know more about the implementation details of the URL/document caching mechanism. Thanks -- Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com AOL-IM: BKClements

Stefan Behnel

15 Aug 15 Aug

6:52 a.m.

New subject: [lxml-dev] problems with document(''), possibly thread related - LXML 'BUG'

Hi, Brad Clements wrote:

...

document('') is "expected" to always mean "the current stylesheet" no matter what URL you named the stylesheet with. Could this be improved by having etree.XSLT attach the stylesheet doc to the returned stylesheet object, or is this too hard and tangled up inside libxslt?

The thing is that when a stylesheet says document(''), libxslt will resolve that URL relative to the stylesheet URL (i.e. replace it with that URL) and then ask lxml about that URL. So the only way to see that the stylesheet was meant is to compare the requested URL to the one of the stylesheet. That is identical to the case that you say document("the stylesheet url"). lxml handles this directly without calling a user provided resolver.

...

Is there any documentation on the internal URL caching mechanism? Is the "cache" shared between parsers? Between threads?

It's local to a single XSLT call. As long as all documents that participate in your XSL transformation (including the stylesheet itself) have unique URLs, you will be safe.

...

If I use from_string(base_url="xyz") somewhere, then from a different parser have a stylesheet that does document('xyz'), will my resolver get called, or the document that was generated from_string be used instead?

The only document URLs that will not be requested through your resolver are the one of the stylesheet and the one of the document that is being transformed. Everything else will be requested before it is added to the cache.

...

My WSGI code is generating stylesheets "on the fly" based on web requests, so I need to know more about the implementation details of the URL/document caching mechanism.

Giving each of them a unique base URL should work in any case. Stefan

5731

Age (days ago)

5733

Last active (days ago)

List overview

Download

4 comments

2 participants

participants (2)

Brad Clements
Stefan Behnel

[lxml-dev] problems with document(''), possibly thread related

Brad Clements

Brad Clements

Stefan Behnel

Brad Clements

Stefan Behnel

tags

participants (2)