Mailman 3 March 2006 - lxml - The Python XML Toolkit

[lxml-dev] 0.9.1 - bug fix release
by Stefan Behnel 31 Mar '06

31 Mar '06

Hello everyone, I just released 0.9.1, mainly as a bug fix release. Cheeseshop has the source: http://cheeseshop.python.org/pypi/lxml Features added: * lxml.sax.ElementTreeContentHandler checks closing elements and raises SaxError on mismatch * lxml.sax.ElementTreeContentHandler now supports namespace-less SAX events (startElement, endElement) and defaults to empty attributes (keyword argument) * zip_safe flag allows setuptools to install lxml as zipped egg * Speedup for repeatedly accessing element tag names * Minor API performance improvements Bugs fixed: * Memory deallocation bug: crash when using XSLT output method "html" * sax.py was handling UTF-8 encoded tag names where it shouldn't * lxml.tests package will no longer be installed (is still in source tar) Martijn and I were very happy with the eggs we received for 0.9, so we kindly hope for a similarly overwhelming response to 0.9.1. :) Have fun, Stefan

2 1

Re: [lxml-dev] malloc issues
by Stefan Behnel 29 Mar '06

29 Mar '06

whit wrote: > the malloc error returns when I call the my function repeated times in a > doctest. only one warning this time. Sorry, I can't reproduce that. Could you send the code that triggers the warning to the list so that I can check it? Or, even better, could you try to cut it down to a simpler test case that shows the same problems? Stefan

1 0

[lxml-dev] malloc issues
by whit 29 Mar '06

29 Mar '06

after installing the latest egg, I have been having issues with seg faults, bus error and been get lots of errors like these: > > python(300) malloc: *** Deallocation of a pointer not malloced: > 0x628d30; This could be a double free(), or free() called with the > middle of an allocated block; Try setting environment variable > MallocHelp to see tools to help debug > python(300) malloc: *** error for object 0x629910: double free > python(300) malloc: *** set a breakpoint in szone_error to debug > > below are the style sheet and function that expose the problem. > > I'm using: > > libxslt 1.1.15 > libxml 2.6.22 > lxml 0.9(trunk) > > gcc-4.0 (osx, tiger) > pyrex (svn from codespeak) > > possibly diagnostic and extremely irritating is that I can't back out > to my previous version of lxml. > > -w > > ------------------------------------------------------------------------ > > # ganked from z0pt and sfive > import os > from StringIO import StringIO > > slug = """ <div><some tag="true"> > <other /> </some> > </div> > """ > > def xstrip(text): > """ > strip out whitespace > >>> print xstrip(slug) > <div><some tag="true"><other></other></some></div> > ... > """ > if not text: > return '' > from lxml import etree > xsltfile = os.path.join(os.path.dirname(__file__), 'strip.xsl') > xslt = open(xsltfile) > xslt_doc = etree.parse(xslt) > style = etree.XSLT(xslt_doc) > xslt.close() > doc = etree.fromstring(text) > result = style(doc) > return str(result) > > import unittest > from zope.testing import doctest > optionflags = doctest.REPORT_ONLY_FIRST_FAILURE | doctest.ELLIPSIS > def test_suite(): > > return unittest.TestSuite(( > doctest.DocTestSuite('xml', optionflags=optionflags) > )) > > if __name__=="__main__": > unittest.TextTestRunner().run(test_suite()) > > ------------------------------------------------------------------------ > > <xsl:stylesheet version='1.0' > xmlns:xsl='http://www.w3.org/1999/XSL/Transform'> > <xsl:output method="html" indent="yes"/> > <xsl:strip-space elements="*"/> > <xsl:template match="@*|node()"> > <xsl:copy><xsl:apply-templates select="@*|node()"/></xsl:copy> > </xsl:template> > </xsl:stylesheet> > -- | david "whit" morriss | | contact :: http://public.xdi.org/=whit "If you don't know where you are, you don't know anything at all" Dr. Edgar Spencer, Ph.D., 1995 "I like to write code like other ppl like to tune their cars or 10kW hifi equipment..." Christian Heimes, 2004

3 4

[lxml-dev] HTML parser support
by Stefan Behnel 29 Mar '06

29 Mar '06

Hi, I created a branch "htmlparser" (as opposed to the previous "htmlparse") and used it to rewrite the current parser to support both the XML and HTML parser API of libxml2 (file src/lxml/parser.pxi). Problem: It doesn't work (yet), it crashes. I cut down the problem to find that it is a problem with the deallocation code. Deallocation of HTML trees (or at least "something" in their representation) seems to be different in libxml2 than for XML. The result is a double free of the document or its nodes - once when releasing an element (attemptDeallocation) and again when releasing the document. This is difficult to debug from Python as both usually happen in one step, when the last element is refcounted. And I still haven't found the actual reason for this. However, I found that removing the call to "attemptDeallocation" from _NodeBase.__dealloc__ for HTML trees solves it. So, I'm not sure how to handle this. It may mean that we have to handle object deallocation different depending on the initial parser - which would be very unfortunate. There may also be an additional tweak to be done at parse time, but I wouldn't know what else to try. (Kasimier?) Anyway, whoever wants to try it, just go ahead. Maybe someone else finds a twist into getting this to work. For testing, there are a few test cases in test_htmlparser.py. Note that they will crash, so I can't add them to the automated test suite. You have to run them manually: PYTHONPATH=src python src/lxml/tests/test_htmlparser.py I left a few debug prints in the source, so don't wonder where the output comes from. Any input on this is appreciated. Stefan

2 2

[lxml-dev] Callgrind tests
by Stefan Behnel 23 Mar '06

23 Mar '06

Hello everyone, another one for the archives. I did a few tests with Callgrind and KCachegrind (if you don't know kcachegrind, install it, you'll love it), as I was suspecting the XPath wrapper to have become slow due to the global function registries. What I found was: 1) libxml2 performance is heavily bound by malloc calls (not sure if callgrind influences this). The XPath implementation is so incredibly fast that the registration of the /builtin/ XPath functions (xmlXPathRegisterAllFunctions) and the related hash table creation (two xmlHashCreate's per XPath context) were the major bottlenecks in my tests. The overhead added by lxml itself was negligible. 2) string formatting in Python was the other problem. The major bottleneck in tree setup in bench.py was the python function that builds the element names based on loop variables (PyString_Format). Meaning, the bottleneck was /outside/ the tested code this time. So, the major result is that, for the tested parts, lxml's performance is mainly bound by two factors: Python and libxml2. I guess I can safely assume that the code parts that I checked are pretty much too small an issue to merit any further optimization efforts. Have fun, Stefan

2 3

[lxml-dev] Files missing from lxml 0.9 win32
by Pete Forman 23 Mar '06

23 Mar '06

I downloaded http://carcass.dhs.org/lxml-0.9.win32-py2.4.exe and ran some of its tests. It is missing some files. So far I've individually downloaded test1.rng and test2.rng. test_broken.xml and test_xinclude.xml are next. They seem to be missing from http://cheeseshop.python.org/packages/2.4/l/lxml/lxml-0.9-py2.4-win32.egg as well. The tgz has the files, I might try installing from that. -- Pete Forman -./\.- Disclaimer: This post is originated WesternGeco -./\.- by myself and does not represent pete.forman(a)westerngeco.com -./\.- opinion of Schlumberger, Baker http://petef.port5.com -./\.- Hughes or their divisions.

4 8

Re: [lxml-dev] Callgrind tests
by Stefan Behnel 23 Mar '06

23 Mar '06

Hi Steve, Steve Howe wrote: > Wednesday, March 22, 2006, 1:38:50 PM, you wrote: >> 2) string formatting in Python was the other problem. The major bottleneck in >> tree setup in bench.py was the python function that builds the element names >> based on loop variables (PyString_Format). Meaning, the bottleneck was >> /outside/ the tested code this time. > > I wonder if running the same tests on cElementTree would point similar > results in what concerns to the Python function calls. Go ahead, try, using KCachegrind is pure fun! :) > Do you have any results (or impressions) on this ? I didn't check, but I don't think it suffers so much from Python performance. As Fredrik said, cElementTree builds Python objects on the way in, so all you should see when /accessing/ data is Python's call overhead rather than any substantial calculations. I think that's totally the right optimization, but it is difficult to do something similar in lxml, since we also get entire trees from the parser. It wouldn't be a good idea to traverse them to build Python objects - we don't even know if they would be used. All we could do is cache Python objects once they were built. The Proxy mechanism would be the right place to keep references to text and tag objects. Also, you could to change the current way Python element proxies are deallocated to keep them alive as long as any of them is really used. But that's non-trivial. Anyway, to make me implement that, I would really have to be convinced that it's worth it - and I absolutely don't see enough of a speed-up behind these optimizations to encourage such a huge effort. Especially the text and tag properties are bound by call overhead, not by object creation time. Stefan

1 0

[lxml-dev] eggs
by Martijn Faassen 22 Mar '06

22 Mar '06

Hi there, I've uploaded lxml 0.9 eggs for both Windows (thanks Steve Howe) and Mac OS X (thanks Georges Racinet) to the Python cheeseshop. The source is there now too: http://cheeseshop.python.org/pypi/lxml/0.9 Thanks everybody! Oh, we should update INSTALL.txt to have a link to the cheeseshop as well. Regards, Martijn

6 12

[lxml-dev] lxml 0.9 on MacOSX
by Georges Racinet 21 Mar '06

21 Mar '06

Hi, I just built lxml 0.9 on my OS X machine and made an egg: total 712 -rw-r--r-- 1 gracinet wheel 363037 Mar 21 12:48 lxml-0.9-py2.4- macosx-10.4-ppc.egg How do you want me to send it ? I didn't really try the package yet, but I ran the tests: $ make test python242 setup.py build_ext -i running build_ext python242 test.py -p -v 230/302 ( 76.2%): Doctest: extensions.txt ---------------------------------------------------------------------- Ran 230 tests in 1.594s OK Is it really normal to run only 230 of them ? Additional info: this is a G5 machine running OS X.4, with a fink (http:// fink.sourceforge.net) install on top of the base system $ xslt-config --version 1.1.14 I wonder if the egg could be used on a vanilla OSX machine (dynamic libs?). Here's what I've got from fink: $ ls /sw/lib/libxml2.* /sw/lib/libxml2.2.6.20.dylib /sw/lib/libxml2.dylib /sw/lib/libxml2.2.dylib /sw/lib/libxml2.la /sw/lib/libxml2.a --------- Georges Racinet Nuxeo SAS gracinet(a)nuxeo.com http://nuxeo.com Tel: +33 (0) 1 40 33 71 73

5 13

[lxml-dev] lxml 0.9 Win32 build
by Steve Howe 21 Mar '06

21 Mar '06

Hello all, Here is a contribution of a Win32 lxml 0.9 binary build for Python 2.4: http://carcass.dhs.org/lxml-0.9.win32-py2.4.exe There are *no* libxml/libxslt dlls on purpose. Those who need these libraries, please refer to: http://www.zlatkovic.com/libxml.en.html Thanks Martijn, Stefan and all involved in the development of lxml. -- Best regards, Steve mailto:howe@carcass.dhs.org

1 0