[lxml-dev] lxml 2.0.5 released

Hi all, lxml 2.0.5 is on PyPI. This is a bug-fix-only release of the stable 2.0 series. Have fun, Stefan 2.0.5 (2008-05-01) Bugs fixed * Resolving to a filename in custom resolvers didn't work. * lxml did not honour libxslt's second error state "STOPPED", which let some XSLT errors pass silently. * Memory leak in Schematron with libxml2 >= 2.6.31.

Greetings, I've been using 2.0 for a while and today I've decided to upgrade to the most recent 2.0.7. I got a problem, and, by binary search (based on change log) :) I found it in 2.0.5 first - it is the local file DTD resolver. This issue originates in http://article.gmane.org/gmane.comp.python.lxml.devel/3499 Eventually I have to load DTD in some specific cases for parsing. Even if I load it from local disc and cache it, the parsing time is longer up to 10 times (40ms instead of 4ms). So, I came up to the following (ugly) solution: class LocalDTDResolver(etree.Resolver): def __init__(self, conf): self.conf = conf self.cached = None def resolve(self, url, id, context): if not self.cached: self.cached = self.resolve_filename( self.conf + '/vxml.dtd' , context ) return self.cached class LxmlUser(...): # just the relevant snippets def __init__(...) self.xmlParser = etree.XMLParser(no_network=True, resolve_entities=False, load_dtd=False) self.resolvingParser = etree.XMLParser(no_network=False, resolve_entities=False, load_dtd=True) self.resolvingParser.resolvers.add(LocalDTDResolver(local_path)) def call_parser(self, replies): for data in replies: if need_resolve: parser = self.resolvingParser else: parser = self.xmlParser xmlres = etree.parse( StringIO.StringIO( data ), parser ) Systems are FreeBSD 6.2/7.0, lxml.etree: (2, 0, 5, 0) libxml used: (2, 6, 30) libxml compiled: (2, 6, 30) libxslt used: (1, 1, 22) libxslt compiled: (1, 1, 22) This code is run within mod_python3/apache2.2.8 Up to 2.0.5 I have no problem when the resolvingParser is called. But since 2.0.5 after I have this: # no call of resolving parser [root@machine ~/trunk/fb-ports/py-lxml]$ sysctl kern.openfiles kern.openfiles: 377 # after a single (!) call of resolving parser [root@machine ~/trunk/fb-ports/py-lxml]$ sysctl kern.openfiles kern.openfiles: 11439 And my local DTD file is opened about 11000 times (according to fstat and find -inode). Am I doing something wrong in such a way of coding or it is a bug? Cheers, Dmitri

Hi, Dmitri Fedoruk wrote:
I got a problem, and, by binary search (based on change log) :) I found it in 2.0.5 first - it is the local file DTD resolver.
I'll take a look.
Not that ugly, but not very helpful either. You are caching the filename, not the content. Check docloader.pxi to see how simple the machinery is here. There isn't currently a way to return a parsed document from a resolver (and I don't think libxml2 supports that), so I think the best you can do is to return the content as a cached string, thus avoiding I/O but not the parse overhead.
Now that you mention it: are you using the single interpreter option in mod_python or does it work without? I fixed a couple of threading things in 2.0.6, so that should now work without that work-around. But it's still untested due to lack of feedback.
If you are really using the above code then it means that libxml2 is reading the DTD internally. Maybe there's something more we have to clean up, or maybe it's really a leak in libxml2. But the numbers you post here look very unrealistic to me.
And my local DTD file is opened about 11000 times (according to fstat and find -inode).
If you parse it once, libxml2 should open the DTD file once, and not more. I'll look into that. Stefan

Hi, Dmitri Fedoruk wrote:
When I run the "resolve_filename_dtd" test in test_etree.py and print "lsof | wc -l" directly before and after parsing, I get the same number each time. So I can't see your leak here. Stefan

Greetings, I've been using 2.0 for a while and today I've decided to upgrade to the most recent 2.0.7. I got a problem, and, by binary search (based on change log) :) I found it in 2.0.5 first - it is the local file DTD resolver. This issue originates in http://article.gmane.org/gmane.comp.python.lxml.devel/3499 Eventually I have to load DTD in some specific cases for parsing. Even if I load it from local disc and cache it, the parsing time is longer up to 10 times (40ms instead of 4ms). So, I came up to the following (ugly) solution: class LocalDTDResolver(etree.Resolver): def __init__(self, conf): self.conf = conf self.cached = None def resolve(self, url, id, context): if not self.cached: self.cached = self.resolve_filename( self.conf + '/vxml.dtd' , context ) return self.cached class LxmlUser(...): # just the relevant snippets def __init__(...) self.xmlParser = etree.XMLParser(no_network=True, resolve_entities=False, load_dtd=False) self.resolvingParser = etree.XMLParser(no_network=False, resolve_entities=False, load_dtd=True) self.resolvingParser.resolvers.add(LocalDTDResolver(local_path)) def call_parser(self, replies): for data in replies: if need_resolve: parser = self.resolvingParser else: parser = self.xmlParser xmlres = etree.parse( StringIO.StringIO( data ), parser ) Systems are FreeBSD 6.2/7.0, lxml.etree: (2, 0, 5, 0) libxml used: (2, 6, 30) libxml compiled: (2, 6, 30) libxslt used: (1, 1, 22) libxslt compiled: (1, 1, 22) This code is run within mod_python3/apache2.2.8 Up to 2.0.5 I have no problem when the resolvingParser is called. But since 2.0.5 after I have this: # no call of resolving parser [root@machine ~/trunk/fb-ports/py-lxml]$ sysctl kern.openfiles kern.openfiles: 377 # after a single (!) call of resolving parser [root@machine ~/trunk/fb-ports/py-lxml]$ sysctl kern.openfiles kern.openfiles: 11439 And my local DTD file is opened about 11000 times (according to fstat and find -inode). Am I doing something wrong in such a way of coding or it is a bug? Cheers, Dmitri

Hi, Dmitri Fedoruk wrote:
I got a problem, and, by binary search (based on change log) :) I found it in 2.0.5 first - it is the local file DTD resolver.
I'll take a look.
Not that ugly, but not very helpful either. You are caching the filename, not the content. Check docloader.pxi to see how simple the machinery is here. There isn't currently a way to return a parsed document from a resolver (and I don't think libxml2 supports that), so I think the best you can do is to return the content as a cached string, thus avoiding I/O but not the parse overhead.
Now that you mention it: are you using the single interpreter option in mod_python or does it work without? I fixed a couple of threading things in 2.0.6, so that should now work without that work-around. But it's still untested due to lack of feedback.
If you are really using the above code then it means that libxml2 is reading the DTD internally. Maybe there's something more we have to clean up, or maybe it's really a leak in libxml2. But the numbers you post here look very unrealistic to me.
And my local DTD file is opened about 11000 times (according to fstat and find -inode).
If you parse it once, libxml2 should open the DTD file once, and not more. I'll look into that. Stefan

Hi, Dmitri Fedoruk wrote:
When I run the "resolve_filename_dtd" test in test_etree.py and print "lsof | wc -l" directly before and after parsing, I get the same number each time. So I can't see your leak here. Stefan
participants (3)
-
andersenlabvb@gmail.com
-
Dmitri Fedoruk
-
Stefan Behnel