Mailman 3 May 2008 - lxml - The Python XML Toolkit

[lxml-dev] lxml has its page on launchpad
by Stefan Behnel 11 Apr '23

11 Apr '23

Hi all, I added the lxml project to launchpad, the Ubuntu Bug-Tracker. It also has a FAQ engine and a couple of other goodies. https://launchpad.net/lxml It's easy to sign up for launchpad, BTW, no 90%-footnotes-contract. Have fun, Stefan

9 9

[lxml-dev] Reparenting a node
by Lawrence Oluyede 30 Jan '23

30 Jan '23

I have a doc A and a doc B, I'd like to put a node extracted from A in the document B but I always get a ValueError: ValueError: Element is not a child of this node. I didn't find any "setparent" in the API. How can I do this? -- Lawrence, oluyede.org - neropercaso.it "It is difficult to get a man to understand something when his salary depends on not understanding it" - Upton Sinclair

3 2

[lxml-dev] lxml 2.0.5 released
by Stefan Behnel 11 Jan '23

11 Jan '23

Hi all, lxml 2.0.5 is on PyPI. This is a bug-fix-only release of the stable 2.0 series. Have fun, Stefan 2.0.5 (2008-05-01) Bugs fixed * Resolving to a filename in custom resolvers didn't work. * lxml did not honour libxslt's second error state "STOPPED", which let some XSLT errors pass silently. * Memory leak in Schematron with libxml2 >= 2.6.31.

3 4

[lxml-dev] Building LXML Trunk
by Sidnei da Silva 31 Aug '22

31 Aug '22

Hi, I've tried to build lxml from trunk today, on Win32. Got the following error: src\lxml\etree.c(880) : error C2059: syntax error : ')' src\lxml\etree.c(881) : error C2059: syntax error : ')' src\lxml\etree.c(882) : error C2059: syntax error : ')' src\lxml\etree.c(883) : error C2059: syntax error : ')' Any clue? Smells like a Pyrex issue? -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

4 4

[lxml-dev] lxml 2.0 beta1 released
by Stefan Behnel 22 Dec '08

22 Dec '08

Hi all, I finally managed to push lxml 2.0beta1 over to PyPI. This release marks the end of the four month alpha cycle of lxml 2.0. The last stable release series, lxml 1.3, saw the light of day more than six months ago. http://codespeak.net/lxml/dev/ http://pypi.python.org/pypi/lxml/2.0beta1 The complete changelog for beta1 and the 2.0 alpha series follows below. Apart from a number of important fixes and enhancements, this beta release also finalises the major API changes that make the difference between 1.x and 2.x. Incompatible changes after this release will require a very good motivation. As usual, compatible enhancements will always be embraced - as will be updates, clarifications and fixes for the documentation! Asking back helps. I expect beta1 to also be the last beta release before lxml 2.0 final (hopefully not in the sense that alpha4/5/6 were), so please test as much as you can to spot any remaining bugs and problems. Note that this release depends on a bug fix in Cython that will hopefully be released as Cython 0.9.6.11 in a couple of days. I attached the necessary patch for those who want work on the sources. Another thing: there was a security advisory on the libxml2 mailing list. To prevent DoS attacks, systems that parse XML from untrusted sources should be updated to libxml2 2.6.31 (or should apply the patch that is referenced in Daniel's post below). http://mail.gnome.org/archives/xml/2008-January/msg00036.html Sidnei, when you build the Windows binaries, could you please wait for libxml2 2.6.31 to become available as binaries as well? Hopefully, that won't take too long... Have fun, Stefan 2.0beta1 (2008-01-11) ===================== Features added -------------- * Parse-time XML schema validation (``schema`` parser keyword). * XPath string results of the ``text()`` function and attribute selection make their Element container accessible through a ``getparent()`` method. As a side-effect, they are now always unicode objects (even ASCII strings). * ``XSLT`` objects are usable in any thread - at the cost of a deep copy if they were not created in that thread. * Invalid entity names and character references will be rejected by the ``Entity()`` factory. * ``entity.text`` returns the textual representation of the entity, e.g. ``&``. Bugs fixed ---------- * XPath on ElementTrees could crash when selecting the virtual root node of the ElementTree. * Compilation ``--without-threading`` was buggy in alpha5/6. Other changes ------------- * Minor performance tweaks for Element instantiation and subelement creation 2.0alpha6 (2007-12-19) ====================== Features added -------------- * New properties ``position`` and ``code`` on ParseError exception (as in ET 1.3) Bugs fixed ---------- * Memory leak in the ``parse()`` function. * Minor bugs in XSLT error message formatting. * Result document memory leak in target parser. Other changes ------------- * Various places in the XPath, XSLT and iteration APIs now require keyword-only arguments. * The argument order in ``element.itersiblings()`` was changed to match the order used in all other iteration methods. The second argument ('preceding') is now a keyword-only argument. * The ``getiterator()`` method on Elements and ElementTrees was reverted to return an iterator as it did in lxml 1.x. The ET API specification allows it to return either a sequence or an iterator, and it traditionally returned a sequence in ET and an iterator in lxml. However, it is now deprecated in favour of the ``iter()`` method, which should be used in new code wherever possible. * The 'pretty printed' serialisation of ElementTree objects now inserts newlines at the root level between processing instructions, comments and the root tag. * A 'pretty printed' serialisation is now terminated with a newline. * Second argument to ``lxml.etree.Extension()`` helper is no longer required, third argument is now a keyword-only argument ``ns``. * ``lxml.html.tostring`` takes an ``encoding`` argument. 2.0alpha5 (2007-11-24) ====================== Features added -------------- * Rich comparison of ``element.attrib`` proxies. * ElementTree compatible TreeBuilder class. * Use default prefixes for some common XML namespaces. * ``lxml.html.clean.Cleaner`` now allows for a ``host_whitelist``, and two overridable methods: ``allow_embedded_url(el, url)`` and the more general ``allow_element(el)``. * Extended slicing of Elements as in ``element[1:-1:2]``, both in etree and in objectify * Resolvers can now provide a ``base_url`` keyword argument when resolving a document as string data. * When using ``lxml.doctestcompare`` you can give the doctest option ``NOPARSE_MARKUP`` (like ``# doctest: +NOPARSE_MARKUP``) to suppress the special checking for one test. Bugs fixed ---------- * Target parser failed to report comments. * In the ``lxml.html`` ``iter_links`` method, links in ``<object>`` tags weren't recognized. (Note: plugin-specific link parameters still aren't recognized.) Also, the ``<embed>`` tag, though not standard, is now included in ``lxml.html.defs.special_inline_tags``. * Using custom resolvers on XSLT stylesheets parsed from a string could request ill-formed URLs. * With ``lxml.doctestcompare`` if you do ``<tag xmlns="...">`` in your output, it will then be namespace-neutral (before the ellipsis was treated as a real namespace). Other changes ------------- * The module source files were renamed to "lxml.*.pyx", such as "lxml.etree.pyx". This was changed for consistency with the way Pyrex commonly handles package imports. The main effect is that classes now know about their fully qualified class name, including the package name of their module. * Keyword-only arguments in some API functions, especially in the parsers and serialisers. 2.0alpha4 (2007-10-07) ====================== Features added -------------- Bugs fixed ---------- * AttributeError in feed parser on parse errors Other changes ------------- * Tag name validation in lxml.etree (and lxml.html) now distinguishes between HTML tags and XML tags based on the parser that was used to parse or create them. HTML tags no longer reject any non-ASCII characters in tag names but only spaces and the special characters ``<>&/"'``. 2.0alpha3 (2007-09-26) ====================== Features added -------------- * Separate ``feed_error_log`` property for the feed parser interface. The normal parser interface and ``iterparse`` continue to use ``error_log``. * The normal parsers and the feed parser interface are now separated and can be used concurrently on the same parser instance. * ``fromstringlist()`` and ``tostringlist()`` functions as in ElementTree 1.3 * ``iterparse()`` accepts an ``html`` boolean keyword argument for parsing with the HTML parser (note that this interface may be subject to change) * Parsers accept an ``encoding`` keyword argument that overrides the encoding of the parsed documents. * New C-API function ``hasChild()`` to test for children * ``annotate()`` function in objectify can annotate with Python types and XSI types in one step. Accompanied by ``xsiannotate()`` and ``pyannotate()``. Bugs fixed ---------- * XML feed parser setup problem * Type annotation for unicode strings in ``DataElement()`` Other changes ------------- * lxml.etree now emits a warning if you use XPath with libxml2 2.6.27 (which can crash on certain XPath errors) * Type annotation in objectify now preserves the already annotated type by default to prevent loosing type information that is already there. 2.0alpha2 (2007-09-15) ====================== Features added -------------- * ``ET.write()``, ``tostring()`` and ``tounicode()`` now accept a keyword argument ``method`` that can be one of 'xml' (or None), 'html' or 'text' to serialise as XML, HTML or plain text content. * ``iterfind()`` method on Elements returns an iterator equivalent to ``findall()`` * ``itertext()`` method on Elements * Setting a QName object as value of the .text property or as an attribute will resolve its prefix in the respective context * ElementTree-like parser target interface as described in http://effbot.org/elementtree/elementtree-xmlparser.htm * ElementTree-like feed parser interface on XMLParser and HTMLParser (``feed()`` and ``close()`` methods) Bugs fixed ---------- * lxml failed to serialise namespace declarations of elements other than the root node of a tree * Race condition in XSLT where the resolver context leaked between concurrent XSLT calls Other changes ------------- * ``element.getiterator()`` returns a list, use ``element.iter()`` to retrieve an iterator (ElementTree 1.3 compatible behaviour) 2.0alpha1 (2007-09-02) ====================== Features added -------------- * Reimplemented ``objectify.E`` for better performance and improved integration with objectify. Provides extended type support based on registered PyTypes. * XSLT objects now support deep copying * New ``makeSubElement()`` C-API function that allows creating a new subelement straight with text, tail and attributes. * XPath extension functions can now access the current context node (``context.context_node``) and use a context dictionary (``context.eval_context``) from the context provided in their first parameter * HTML tag soup parser based on BeautifulSoup in ``lxml.html.ElementSoup`` * New module ``lxml.doctestcompare`` by Ian Bicking for writing simplified doctests based on XML/HTML output. Use by importing ``lxml.usedoctest`` or ``lxml.html.usedoctest`` from within a doctest. * New module ``lxml.cssselect`` by Ian Bicking for selecting Elements with CSS selectors. * New package ``lxml.html`` written by Ian Bicking for advanced HTML treatment. * Namespace class setup is now local to the ``ElementNamespaceClassLookup`` instance and no longer global. * Schematron validation (incomplete in libxml2) * Additional ``stringify`` argument to ``objectify.PyType()`` takes a conversion function to strings to support setting text values from arbitrary types. * Entity support through an ``Entity`` factory and element classes. XML parsers now have a ``resolve_entities`` keyword argument that can be set to False to keep entities in the document. * ``column`` field on error log entries to accompany the ``line`` field * Error specific messages in XPath parsing and evaluation NOTE: for evaluation errors, you will now get an XPathEvalError instead of an XPathSyntaxError. To catch both, you can except on ``XPathError`` * The regular expression functions in XPath now support passing a node-set instead of a string * Extended type annotation in objectify: new ``xsiannotate()`` function * EXSLT RegExp support in standard XPath (not only XSLT) Bugs fixed ---------- * lxml.etree did not check tag/attribute names * The XML parser did not report undefined entities as error * The text in exceptions raised by XML parsers, validators and XPath evaluators now reports the first error that occurred instead of the last * Passing '' as XPath namespace prefix did not raise an error * Thread safety in XPath evaluators Other changes ------------- * objectify.PyType for None is now called "NoneType" * ``el.getiterator()`` renamed to ``el.iter()``, following ElementTree 1.3 - original name is still available as alias * In the public C-API, ``findOrBuildNodeNs()`` was replaced by the more generic ``findOrBuildNodeNsPrefix`` * Major refactoring in XPath/XSLT extension function code * Network access in parsers disabled by default

4 9

[lxml-dev] validation with multiple XSD files
by Arye 07 Jul '08

07 Jul '08

Hello all, I would like to so some schema validation and started with the instructions in : http://codespeak.net/lxml/dev/validation.html#xmlschema This all works great. Now I would like to extend this to a XSD file that includes many other files. In other words I have a directory of XSD files that I would like to use. The include statement look like this (the included file is referenced by its name): <?xml version="1.0"?> <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified"> <xsd:include schemaLocation="base.xsd"/> <xsd:element name="Price"> ... ... some types defined in "base.xsd" are used here I am new to lxml so sorry in advance if the question does not make sense. Regards, Arye.

3 4

[lxml-dev] Pickling objectified trees
by Christian Zagrodnick 09 Jun '08

09 Jun '08

Hi, the other day I had to pickle objectified trees. I just thought to share my findings. Pickling is about serialization. IMHO the natural serialization of an objectified tree is its XML representation. So the following basically does that: -------------------------- import copy_reg import lxml.etree import lxml.objectify def treeFactory(state): """Un-Pickle factory.""" return lxml.objectify.fromstring(state) copy_reg.constructor(treeFactory) def reduceObjectifiedElement(object): """Reduce function for lxml.objectify trees. See http://docs.python.org/lib/pickle-protocol.html for details. """ return (treeFactory, (lxml.etree.tostring(object), )) copy_reg.pickle(lxml.objectify.ObjectifiedElement, reduceObjectifiedElement, treeFactory) ----------------------------------------- You might consider just registering the reduce function in lxml itself. Shouldn't hurt, should it. -- Christian Zagrodnick gocept gmbh & co. kg · forsterstrasse 29 · 06112 halle/saale www.gocept.com · fon. +49 345 12298894 · fax. +49 345 12298891

2 4

[lxml-dev] first lessons learned while porting lxml to Py3
by Stefan Behnel 31 May '08

31 May '08

Hi, since we had a lengthy discussion on whether or not non-prefixed byte strings should automatically mutate into unicode strings when compiled for Py3, here are some initial lessons from my first attempt to port lxml. My first approach was (obviously) to import unicode_literals from __future__. This failed miserably, and even showed a couple of further bugs in Cython. :) I then chose the route to explicitly prepend unicode strings with 'u', as I wanted to keep my source compilable with older Cython versions that do not support the 'b' prefix. Currently, I have changed about 700 lines this way in a quick walk-through, and now I'm searching the places where this was the wrong thing to do. :) Most important evidence found: it's definitely non-trivial in a lot of places to decide what has to be unicode and what doesn't. It's non-trivial for me, and definitely not easier for Cython. One important place where I ended up with a lot of trivial changes are docstrings. Here, I would give an almost 100% chance that the user meant a unicode string if it's not prefixed. The remaining cases, e.g. where some external tool may require binary data for some kind of configuration or analysis are rare enough to just ignore them. For exactly this reason (I think), the doctest module in Py3 ignores docstrings that are not unicode. This might be a place where an automatic conversion might make sense (although, if it's the only place, that would be some funny string semantics...) Another important place are exception messages. Here, I'd give a real 100% for string literals, as their only purpose is to be human readable. A field where I really had to take care is when working with byte sequences. For example, lxml has a couple of places where strings are converted into UTF-8 and then passed into re.findall() or re.sub(). When substituting, the replacement string obviously has to be a byte string, too. I also found a bug in the Py3 re module when working with byte strings in one specific case. There are actually quite a number of places where strings are built as byte strings by combining and formatting literals, and then converted to a char*. Another place where automatic conversion must not happen. So, while still on the way, my first real-world impression meets my original opinion. There are definitely a lot of unprefixed strings in my own code that are meant to be unicode strings. Simply switching their type in Py3 will fix a lot of them, but at the same time break many others. The things that it fixes are the trivial parts: docstrings and exceptions. Almost everything else really were byte strings, and some were non-trivial things that need real work. If I can choose, I opt for going through this once and then having code that correctly distinguishes between byte strings and unicode strings in *both* Py2 and Py3, instead of additionally having to deal with changing string semantics for identical code in different environments. We might think about a way to simplify the transition from unprefixed docstrings and exception messages to unicode strings. As it currently stands, everything else is definitely out of scope for any automatism. Stefan

3 4

[lxml-dev] lxml 2.0.6 released
by Stefan Behnel 31 May '08

31 May '08

Hi, lxml 2.0.6 is on PyPI. This is a bug fix only release for the stable 2.0 series. As a long-standing threading problem was solved, updating is generally recommended, although it should not affect currently working code. It should, however, make it possible to run lxml threaded under mod_python and friends. Feedback is welcome. This release should also make the life easier for MacOS-X users. Have fun, Stefan 2.0.6 (2008-05-31) Features added Bugs fixed * Incorrect evaluation of el.find("tag[child]"). * Windows build was broken. * Moving a subtree from a document created in one thread into a document of another thread could crash when the rest of the source document is deleted while the subtree is still in use. * Rare crash when serialising to a file object with certain encodings. Other changes * lxml should now build without problems on MacOS-X.

1 0

[lxml-dev] Python 3 changes in lxml 2.1
by Stefan Behnel 31 May '08

31 May '08

Hi, as it currently seems, lxml 2.1 will support Python 2.6 and Python 3 out of the box. While fixing up lxml 2.1beta to make this work, I found a couple of things that I needed to change. Here's an (incomplete) list, so that people can start shouting at me for breaking their code. ;) One major thing that changed is that the API will now always return unicode strings for non byte stream data (.text, .tag, namespaces, ...), whereas it continues to return a byte string for plain ASCII data in Py2. Two things have become a bit quirky now. We currently return a subclass of ElementTree from XSLT, and you can call str(tree) on it to get the result. Returning a byte string here raises an exception in Py3, so that str(result) now behaves as unicode(result) did before, i.e. it returns a Python unicode string. To get the expected result as a byte string, people will have to use the new buffer protocol instead (memoryview&friends). This also means that bytes(xslt_result) will work as expected. Sadly, this means that there isn't a way to get the result in a portable way. I'm thinking about adding a .tobytes() method, but I'm not sure this is really helpful. The second quirk is serialisation to a unicode string. Instead of tostring(root, encoding=unicode) you now have to write tostring(root, encoding=str) so this requires source adaptation. Then again, this is (hopefully) a rare usage anyway and most Python code will require Py3 changes anyway. Haven't checked, but the 2to3 tool should normally take care of this. The ugliest problem I found so far is with doctests. There just isn't a way to write a Py2/Py3 portable doctest that accepts exactly a byte string or unicode strings as output, as both look different in Py2 and Py3. Also, exception names are now fully qualified, so that tracebacks look different. Tons of failing tests for nothing... Stefan

2 2