[lxml-dev] lxml 2.0 beta1 released
Hi all, I finally managed to push lxml 2.0beta1 over to PyPI. This release marks the end of the four month alpha cycle of lxml 2.0. The last stable release series, lxml 1.3, saw the light of day more than six months ago. http://codespeak.net/lxml/dev/ http://pypi.python.org/pypi/lxml/2.0beta1 The complete changelog for beta1 and the 2.0 alpha series follows below. Apart from a number of important fixes and enhancements, this beta release also finalises the major API changes that make the difference between 1.x and 2.x. Incompatible changes after this release will require a very good motivation. As usual, compatible enhancements will always be embraced - as will be updates, clarifications and fixes for the documentation! Asking back helps. I expect beta1 to also be the last beta release before lxml 2.0 final (hopefully not in the sense that alpha4/5/6 were), so please test as much as you can to spot any remaining bugs and problems. Note that this release depends on a bug fix in Cython that will hopefully be released as Cython 0.9.6.11 in a couple of days. I attached the necessary patch for those who want work on the sources. Another thing: there was a security advisory on the libxml2 mailing list. To prevent DoS attacks, systems that parse XML from untrusted sources should be updated to libxml2 2.6.31 (or should apply the patch that is referenced in Daniel's post below). http://mail.gnome.org/archives/xml/2008-January/msg00036.html Sidnei, when you build the Windows binaries, could you please wait for libxml2 2.6.31 to become available as binaries as well? Hopefully, that won't take too long... Have fun, Stefan 2.0beta1 (2008-01-11) ===================== Features added -------------- * Parse-time XML schema validation (``schema`` parser keyword). * XPath string results of the ``text()`` function and attribute selection make their Element container accessible through a ``getparent()`` method. As a side-effect, they are now always unicode objects (even ASCII strings). * ``XSLT`` objects are usable in any thread - at the cost of a deep copy if they were not created in that thread. * Invalid entity names and character references will be rejected by the ``Entity()`` factory. * ``entity.text`` returns the textual representation of the entity, e.g. ``&``. Bugs fixed ---------- * XPath on ElementTrees could crash when selecting the virtual root node of the ElementTree. * Compilation ``--without-threading`` was buggy in alpha5/6. Other changes ------------- * Minor performance tweaks for Element instantiation and subelement creation 2.0alpha6 (2007-12-19) ====================== Features added -------------- * New properties ``position`` and ``code`` on ParseError exception (as in ET 1.3) Bugs fixed ---------- * Memory leak in the ``parse()`` function. * Minor bugs in XSLT error message formatting. * Result document memory leak in target parser. Other changes ------------- * Various places in the XPath, XSLT and iteration APIs now require keyword-only arguments. * The argument order in ``element.itersiblings()`` was changed to match the order used in all other iteration methods. The second argument ('preceding') is now a keyword-only argument. * The ``getiterator()`` method on Elements and ElementTrees was reverted to return an iterator as it did in lxml 1.x. The ET API specification allows it to return either a sequence or an iterator, and it traditionally returned a sequence in ET and an iterator in lxml. However, it is now deprecated in favour of the ``iter()`` method, which should be used in new code wherever possible. * The 'pretty printed' serialisation of ElementTree objects now inserts newlines at the root level between processing instructions, comments and the root tag. * A 'pretty printed' serialisation is now terminated with a newline. * Second argument to ``lxml.etree.Extension()`` helper is no longer required, third argument is now a keyword-only argument ``ns``. * ``lxml.html.tostring`` takes an ``encoding`` argument. 2.0alpha5 (2007-11-24) ====================== Features added -------------- * Rich comparison of ``element.attrib`` proxies. * ElementTree compatible TreeBuilder class. * Use default prefixes for some common XML namespaces. * ``lxml.html.clean.Cleaner`` now allows for a ``host_whitelist``, and two overridable methods: ``allow_embedded_url(el, url)`` and the more general ``allow_element(el)``. * Extended slicing of Elements as in ``element[1:-1:2]``, both in etree and in objectify * Resolvers can now provide a ``base_url`` keyword argument when resolving a document as string data. * When using ``lxml.doctestcompare`` you can give the doctest option ``NOPARSE_MARKUP`` (like ``# doctest: +NOPARSE_MARKUP``) to suppress the special checking for one test. Bugs fixed ---------- * Target parser failed to report comments. * In the ``lxml.html`` ``iter_links`` method, links in ``<object>`` tags weren't recognized. (Note: plugin-specific link parameters still aren't recognized.) Also, the ``<embed>`` tag, though not standard, is now included in ``lxml.html.defs.special_inline_tags``. * Using custom resolvers on XSLT stylesheets parsed from a string could request ill-formed URLs. * With ``lxml.doctestcompare`` if you do ``<tag xmlns="...">`` in your output, it will then be namespace-neutral (before the ellipsis was treated as a real namespace). Other changes ------------- * The module source files were renamed to "lxml.*.pyx", such as "lxml.etree.pyx". This was changed for consistency with the way Pyrex commonly handles package imports. The main effect is that classes now know about their fully qualified class name, including the package name of their module. * Keyword-only arguments in some API functions, especially in the parsers and serialisers. 2.0alpha4 (2007-10-07) ====================== Features added -------------- Bugs fixed ---------- * AttributeError in feed parser on parse errors Other changes ------------- * Tag name validation in lxml.etree (and lxml.html) now distinguishes between HTML tags and XML tags based on the parser that was used to parse or create them. HTML tags no longer reject any non-ASCII characters in tag names but only spaces and the special characters ``<>&/"'``. 2.0alpha3 (2007-09-26) ====================== Features added -------------- * Separate ``feed_error_log`` property for the feed parser interface. The normal parser interface and ``iterparse`` continue to use ``error_log``. * The normal parsers and the feed parser interface are now separated and can be used concurrently on the same parser instance. * ``fromstringlist()`` and ``tostringlist()`` functions as in ElementTree 1.3 * ``iterparse()`` accepts an ``html`` boolean keyword argument for parsing with the HTML parser (note that this interface may be subject to change) * Parsers accept an ``encoding`` keyword argument that overrides the encoding of the parsed documents. * New C-API function ``hasChild()`` to test for children * ``annotate()`` function in objectify can annotate with Python types and XSI types in one step. Accompanied by ``xsiannotate()`` and ``pyannotate()``. Bugs fixed ---------- * XML feed parser setup problem * Type annotation for unicode strings in ``DataElement()`` Other changes ------------- * lxml.etree now emits a warning if you use XPath with libxml2 2.6.27 (which can crash on certain XPath errors) * Type annotation in objectify now preserves the already annotated type by default to prevent loosing type information that is already there. 2.0alpha2 (2007-09-15) ====================== Features added -------------- * ``ET.write()``, ``tostring()`` and ``tounicode()`` now accept a keyword argument ``method`` that can be one of 'xml' (or None), 'html' or 'text' to serialise as XML, HTML or plain text content. * ``iterfind()`` method on Elements returns an iterator equivalent to ``findall()`` * ``itertext()`` method on Elements * Setting a QName object as value of the .text property or as an attribute will resolve its prefix in the respective context * ElementTree-like parser target interface as described in http://effbot.org/elementtree/elementtree-xmlparser.htm * ElementTree-like feed parser interface on XMLParser and HTMLParser (``feed()`` and ``close()`` methods) Bugs fixed ---------- * lxml failed to serialise namespace declarations of elements other than the root node of a tree * Race condition in XSLT where the resolver context leaked between concurrent XSLT calls Other changes ------------- * ``element.getiterator()`` returns a list, use ``element.iter()`` to retrieve an iterator (ElementTree 1.3 compatible behaviour) 2.0alpha1 (2007-09-02) ====================== Features added -------------- * Reimplemented ``objectify.E`` for better performance and improved integration with objectify. Provides extended type support based on registered PyTypes. * XSLT objects now support deep copying * New ``makeSubElement()`` C-API function that allows creating a new subelement straight with text, tail and attributes. * XPath extension functions can now access the current context node (``context.context_node``) and use a context dictionary (``context.eval_context``) from the context provided in their first parameter * HTML tag soup parser based on BeautifulSoup in ``lxml.html.ElementSoup`` * New module ``lxml.doctestcompare`` by Ian Bicking for writing simplified doctests based on XML/HTML output. Use by importing ``lxml.usedoctest`` or ``lxml.html.usedoctest`` from within a doctest. * New module ``lxml.cssselect`` by Ian Bicking for selecting Elements with CSS selectors. * New package ``lxml.html`` written by Ian Bicking for advanced HTML treatment. * Namespace class setup is now local to the ``ElementNamespaceClassLookup`` instance and no longer global. * Schematron validation (incomplete in libxml2) * Additional ``stringify`` argument to ``objectify.PyType()`` takes a conversion function to strings to support setting text values from arbitrary types. * Entity support through an ``Entity`` factory and element classes. XML parsers now have a ``resolve_entities`` keyword argument that can be set to False to keep entities in the document. * ``column`` field on error log entries to accompany the ``line`` field * Error specific messages in XPath parsing and evaluation NOTE: for evaluation errors, you will now get an XPathEvalError instead of an XPathSyntaxError. To catch both, you can except on ``XPathError`` * The regular expression functions in XPath now support passing a node-set instead of a string * Extended type annotation in objectify: new ``xsiannotate()`` function * EXSLT RegExp support in standard XPath (not only XSLT) Bugs fixed ---------- * lxml.etree did not check tag/attribute names * The XML parser did not report undefined entities as error * The text in exceptions raised by XML parsers, validators and XPath evaluators now reports the first error that occurred instead of the last * Passing '' as XPath namespace prefix did not raise an error * Thread safety in XPath evaluators Other changes ------------- * objectify.PyType for None is now called "NoneType" * ``el.getiterator()`` renamed to ``el.iter()``, following ElementTree 1.3 - original name is still available as alias * In the public C-API, ``findOrBuildNodeNs()`` was replaced by the more generic ``findOrBuildNodeNsPrefix`` * Major refactoring in XPath/XSLT extension function code * Network access in parsers disabled by default
On Jan 11, 2008 1:42 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Sidnei, when you build the Windows binaries, could you please wait for libxml2 2.6.31 to become available as binaries as well? Hopefully, that won't take too long...
Note taken! -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214
Hello,
I finally managed to push lxml 2.0beta1 over to PyPI. This release marks the end of the four month alpha cycle of lxml 2.0. The last stable release series, lxml 1.3, saw the light of day more than six months ago.
That's great! Speaking about your last proposal abot the apache crash, where you've suggested to get the svn snapshot - is the needed code in this build already? I've applied the patch for xslt (which is obviously marked as "`XSLT`` objects are usable in any thread" feature) to the 1.3.5 version of the library. I'll upgrade all my machines to the new build and see what happens :) Cheers, Dmitri
Hi list, In the documentation it says that lxml can automatically detect and process gzipped xml (.gz). Which I'm sure (but haven't tried) works when it's parsing from a file with the appropriate extension, but is this possible from an in memory string? My situation: I have a berkeley db based storage system which maintains gzipped xml. I currently just use python's gzip module to uncompress before sending to lxml, but if I could skip this step I'm sure there'd be good performance benefits. Have looked for how to do this but no luck. Perhaps a parser option? Thanks! Rob
Hi, Dr R. Sanderson wrote:
In the documentation it says that lxml can automatically detect and process gzipped xml (.gz). Which I'm sure (but haven't tried) works when it's parsing from a file with the appropriate extension, but is this possible from an in memory string?
My situation: I have a berkeley db based storage system which maintains gzipped xml. I currently just use python's gzip module to uncompress before sending to lxml, but if I could skip this step I'm sure there'd be good performance benefits.
Yes, I recently thought about that, too, mainly in the context of pickling. http://comments.gmane.org/gmane.comp.gnome.lib.xml.general/14465 It would be something to implement, though, as the support in libxml2 is restricted to files. Supporting this for in-memory data isn't that hard, but it would require writing a callback-driven filter for a libxml2 I/O output buffer: buffer what gets written, compress it, write it out to the next output buffer. Not hard, but not entirely trivial either. Stefan
Back in May, Stefan wrote:
[but yes, there will be lxml for Python 3, and pretty soon]
Any news on the Py3k front? (I'm in the process of scoping out just how hard it's going to be to update our code to 3000, starting with the dependencies) Many Thanks! Rob
Hi, Dr R. Sanderson wrote:
Back in May, Stefan wrote:
[but yes, there will be lxml for Python 3, and pretty soon]
Any news on the Py3k front?
It's there in general, so you can compile lxml under Py3 and run your code against it for pure testing purposes. However, due to changes in Py3.0 beta2, you can get crashes in the exception handling code that Cython generates. There seem to be slight changes in the way exceptions interact with the frame cleanup in Py3 now. And Cython does not use frames at all but emulates them, apparently not well enough for the latest Py3 beta... I'm working on fixing this, but I don't know when this will be done. It may take a couple of weeks, and will require a new source release of 2.1.x. Stefan
[but yes, there will be lxml for Python 3, and pretty soon] Any news on the Py3k front?
It's there in general, so you can compile lxml under Py3 and run your code against it for pure testing purposes.
Fantastic :) And the thinko that was causing my problem is that fromstring() is all lowercase not fromString(). Duh. Haven't run into any of the crashes yet.
However, due to changes in Py3.0 beta2, you can get crashes in the exception [...] I'm working on fixing this, but I don't know when this will be done. It may take a couple of weeks, and will require a new source release of 2.1.x.
No problem! Many thanks for the prompt reply, Rob
Hi all, I'm working on a script to replicate it, but using 2.1.2 or more recent results in not freeing any memory when parsing multiple documents in quick succession. The changelog says there was a memory issue fixed, so perhaps this introduced the bug at the same time? I've seen (but not consistently) the lxml memory allocation failed: growing buffer message. Normally it just runs my machine out of memory. Rob
The actual code is below, but I've got it so that it inflates Very Quickly... [cheshire@edhellond jstor]$ ./memory.py UID PID PPID C SZ RSS PSR STIME TTY TIME CMD cheshire 1778 1154 0 5861 14204 1 14:25 pts/2 00:00:00 /home/cheshire/install/bin/python -i ./memory.py 0 cheshire 1778 1154 99 20753 73820 1 14:25 pts/2 00:00:01 /home/cheshire/install/bin/python -i ./memory.py 238 cheshire 1778 1154 99 140239 551556 1 14:25 pts/2 00:00:08 /home/cheshire/install/bin/python -i ./memory.py 483 cheshire 1778 1154 99 245972 974616 1 14:25 pts/2 00:00:14 /home/cheshire/install/bin/python -i ./memory.py 734 cheshire 1778 1154 99 319488 1268656 1 14:25 pts/2 00:00:24 /home/cheshire/install/bin/python -i ./memory.py 1269 eg, after parsing 1269 documents (on average 250k each) it's using a total of 1.5 gigabytes of memory. This also happens in 2.1.1. I've used guppy/hpy to check that it's not python level code. Putting in a hp.heap() call in the loop shows the only difference to be the for loop's frame, per iteration. The actual production code works in 2.1.1, but has a lot more xpaths and then a serialization phase in the loop as well. Code, with comments: ---------------------------- def build_journal(jrnl): global nparse # Search for journal descriptions q = parse('c3.idx-id-journal exact "%s"' % jrnl) rs = db.search(session, q) # step through matches for rsi in rs: nparse += 1 # fetch record out of storage, use etree.XML(data) to parse rec = rsi.fetch_record(session) # process_xpath passes through directly to node.xpath() try: year = rec.process_xpath(session, '/issuemap/issue-meta/numerations/pub-date/year/text()')[0] month = rec.process_xpath(session, '/issuemap/issue-meta/numerations/pub-date/month/text()')[0] day = rec.process_xpath(session, '/issuemap/issue-meta/numerations/pub-date/day/text()')[0] except: rsi._ymd = (0,0,0) del rec continue rsi._ymd = (year, month, day) del rec # sort list based on date rs._list.sort(key=lambda x: x._ymd) del rs nparse = 0 # scan through all journal identifiers q = parse('c3.idx-id-journal exact ""') jids = db.scan(session, q, 1000000) # get OS memory usage stats pid = os.getpid() cmd = "ps -F -p %s" % pid print commands.getoutput(cmd) print nparse # and try to build for j in jids[100:]: build_journal(j[0]) print commands.getoutput(cmd).split('\n')[1] print nparse ---------------------------------------- Help? Rob On Mon, 22 Dec 2008, Dr R. Sanderson wrote:
Hi all,
I'm working on a script to replicate it, but using 2.1.2 or more recent results in not freeing any memory when parsing multiple documents in quick succession. The changelog says there was a memory issue fixed, so perhaps this introduced the bug at the same time?
I've seen (but not consistently) the lxml memory allocation failed: growing buffer message. Normally it just runs my machine out of memory.
Rob _______________________________________________ lxml-dev mailing list lxml-dev@codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev
participants (4)
-
Dmitri Fedoruk
-
Dr R. Sanderson
-
Sidnei da Silva
-
Stefan Behnel