lxml 2.0 released
stefan_ml at behnel.de
Fri Feb 1 19:43:49 CET 2008
I'm very happy to announce the official release of lxml 2.0!
** Install it with
$ easy_install lxml==2.0
** What is lxml?
In short: lxml is the most feature-rich and easy-to-use library for working
with XML and HTML in the Python language.
lxml is a Pythonic binding for the libxml2 and libxslt libraries. It is unique
in that it combines the speed and feature completeness of these libraries with
the simplicity of a native Python API.
This release marks the end of a development effort of more than 6 months,
starting with the release of the last stable series lxml 1.3. The major
differences are explained on this page:
lxml 2.0 is not a revolution, it is a gradual move towards a cleaner API with
more things working together as expected. But it nevertheless comes with a lot
of new tools and features, that makes your XML life easier - and even more
your HTML life. There are also a couple of minor things that were deprecated,
which will be removed for lxml 2.1. See the above link for details.
The new release has already adopted a lot of changes from the upcoming
ElementTree 1.3 library, and implements a much broader set of compatible
features, such as the TreeBuilder interface for parser targets.
The complete changelog follows.
* Passing the ``unicode`` type as ``encoding`` to ``tostring()`` will
serialise to unicode. The ``tounicode()`` function is now officially
* ``XMLSchema()`` and ``RelaxNG()`` can parse from StringIO.
* ``makeparser()`` function in ``lxml.objectify`` to create a new
parser with the usual objectify setup.
* Plain ASCII XPath string results are no longer forced into unicode
objects as in 2.0beta1, but are returned as plain strings as before.
* All XPath string results are 'smart' objects that have a
``getparent()`` method to retrieve their parent Element.
* ``with_tail`` option in serialiser functions.
* More accurate exception messages in validator creation.
* Missing import in ``lxml.html.clean``.
* Some Python 2.4-isms prevented lxml from building/running under
* Exceptions carry only the part of the error log that is related to
the operation that caused the error.
* ``XMLSchema()`` and ``RelaxNG()`` now enforce passing the source
file/filename through the ``file`` keyword argument.
* The test suite now skips most doctests under Python 2.3.
* ``make clean`` no longer removes the .c files (use ``make
* Parse-time XML schema validation (``schema`` parser keyword).
* XPath string results of the ``text()`` function and attribute
selection make their Element container accessible through a
``getparent()`` method. As a side-effect, they are now always
unicode objects (even ASCII strings).
* ``XSLT`` objects are usable in any thread - at the cost of a deep
copy if they were not created in that thread.
* Invalid entity names and character references will be rejected by
the ``Entity()`` factory.
* ``entity.text`` returns the textual representation of the entity,
* XPath on ElementTrees could crash when selecting the virtual root
node of the ElementTree.
* Compilation ``--without-threading`` was buggy in alpha5/6.
* Minor performance tweaks for Element instantiation and subelement
* New properties ``position`` and ``code`` on ParseError exception (as
in ET 1.3)
* Memory leak in the ``parse()`` function.
* Minor bugs in XSLT error message formatting.
* Result document memory leak in target parser.
* Various places in the XPath, XSLT and iteration APIs now require
* The argument order in ``element.itersiblings()`` was changed to
match the order used in all other iteration methods. The second
argument ('preceding') is now a keyword-only argument.
* The ``getiterator()`` method on Elements and ElementTrees was
reverted to return an iterator as it did in lxml 1.x. The ET API
specification allows it to return either a sequence or an iterator,
and it traditionally returned a sequence in ET and an iterator in
lxml. However, it is now deprecated in favour of the ``iter()``
method, which should be used in new code wherever possible.
* The 'pretty printed' serialisation of ElementTree objects now
inserts newlines at the root level between processing instructions,
comments and the root tag.
* A 'pretty printed' serialisation is now terminated with a newline.
* Second argument to ``lxml.etree.Extension()`` helper is no longer
required, third argument is now a keyword-only argument ``ns``.
* ``lxml.html.tostring`` takes an ``encoding`` argument.
* Rich comparison of ``element.attrib`` proxies.
* ElementTree compatible TreeBuilder class.
* Use default prefixes for some common XML namespaces.
* ``lxml.html.clean.Cleaner`` now allows for a ``host_whitelist``, and
two overridable methods: ``allow_embedded_url(el, url)`` and the
more general ``allow_element(el)``.
* Extended slicing of Elements as in ``element[1:-1:2]``, both in
etree and in objectify
* Resolvers can now provide a ``base_url`` keyword argument when
resolving a document as string data.
* When using ``lxml.doctestcompare`` you can give the doctest option
``NOPARSE_MARKUP`` (like ``# doctest: +NOPARSE_MARKUP``) to suppress
the special checking for one test.
* Target parser failed to report comments.
* In the ``lxml.html`` ``iter_links`` method, links in ``<object>``
tags weren't recognized. (Note: plugin-specific link parameters
still aren't recognized.) Also, the ``<embed>`` tag, though not
standard, is now included in ``lxml.html.defs.special_inline_tags``.
* Using custom resolvers on XSLT stylesheets parsed from a string
could request ill-formed URLs.
* With ``lxml.doctestcompare`` if you do ``<tag xmlns="...">`` in your
output, it will then be namespace-neutral (before the ellipsis was
treated as a real namespace).
* The module source files were renamed to "lxml.*.pyx", such as
"lxml.etree.pyx". This was changed for consistency with the way
Pyrex commonly handles package imports. The main effect is that
classes now know about their fully qualified class name, including
the package name of their module.
* Keyword-only arguments in some API functions, especially in the
parsers and serialisers.
* AttributeError in feed parser on parse errors
* Tag name validation in lxml.etree (and lxml.html) now distinguishes
between HTML tags and XML tags based on the parser that was used to
parse or create them. HTML tags no longer reject any non-ASCII
characters in tag names but only spaces and the special characters
* Separate ``feed_error_log`` property for the feed parser interface.
The normal parser interface and ``iterparse`` continue to use
* The normal parsers and the feed parser interface are now separated
and can be used concurrently on the same parser instance.
* ``fromstringlist()`` and ``tostringlist()`` functions as in
* ``iterparse()`` accepts an ``html`` boolean keyword argument for
parsing with the HTML parser (note that this interface may be
subject to change)
* Parsers accept an ``encoding`` keyword argument that overrides the encoding
of the parsed documents.
* New C-API function ``hasChild()`` to test for children
* ``annotate()`` function in objectify can annotate with Python types and XSI
types in one step. Accompanied by ``xsiannotate()`` and ``pyannotate()``.
* XML feed parser setup problem
* Type annotation for unicode strings in ``DataElement()``
* lxml.etree now emits a warning if you use XPath with libxml2 2.6.27
(which can crash on certain XPath errors)
* Type annotation in objectify now preserves the already annotated type by
default to prevent loosing type information that is already there.
* ``ET.write()``, ``tostring()`` and ``tounicode()`` now accept a keyword
argument ``method`` that can be one of 'xml' (or None), 'html' or 'text' to
serialise as XML, HTML or plain text content.
* ``iterfind()`` method on Elements returns an iterator equivalent to
* ``itertext()`` method on Elements
* Setting a QName object as value of the .text property or as an attribute
will resolve its prefix in the respective context
* ElementTree-like parser target interface as described in
* ElementTree-like feed parser interface on XMLParser and HTMLParser
(``feed()`` and ``close()`` methods)
* lxml failed to serialise namespace declarations of elements other than the
root node of a tree
* Race condition in XSLT where the resolver context leaked between concurrent
* ``element.getiterator()`` returns a list, use ``element.iter()`` to retrieve
an iterator (ElementTree 1.3 compatible behaviour)
* Reimplemented ``objectify.E`` for better performance and improved
integration with objectify. Provides extended type support based on
* XSLT objects now support deep copying
* New ``makeSubElement()`` C-API function that allows creating a new
subelement straight with text, tail and attributes.
* XPath extension functions can now access the current context node
(``context.context_node``) and use a context dictionary
(``context.eval_context``) from the context provided in their first
* HTML tag soup parser based on BeautifulSoup in ``lxml.html.ElementSoup``
* New module ``lxml.doctestcompare`` by Ian Bicking for writing simplified
doctests based on XML/HTML output. Use by importing ``lxml.usedoctest`` or
``lxml.html.usedoctest`` from within a doctest.
* New module ``lxml.cssselect`` by Ian Bicking for selecting Elements with CSS
* New package ``lxml.html`` written by Ian Bicking for advanced HTML
* Namespace class setup is now local to the ``ElementNamespaceClassLookup``
instance and no longer global.
* Schematron validation (incomplete in libxml2)
* Additional ``stringify`` argument to ``objectify.PyType()`` takes a
conversion function to strings to support setting text values from arbitrary
* Entity support through an ``Entity`` factory and element classes. XML
parsers now have a ``resolve_entities`` keyword argument that can be set to
False to keep entities in the document.
* ``column`` field on error log entries to accompany the ``line`` field
* Error specific messages in XPath parsing and evaluation
NOTE: for evaluation errors, you will now get an XPathEvalError instead of
an XPathSyntaxError. To catch both, you can except on ``XPathError``
* The regular expression functions in XPath now support passing a node-set
instead of a string
* Extended type annotation in objectify: new ``xsiannotate()`` function
* EXSLT RegExp support in standard XPath (not only XSLT)
* lxml.etree did not check tag/attribute names
* The XML parser did not report undefined entities as error
* The text in exceptions raised by XML parsers, validators and XPath
evaluators now reports the first error that occurred instead of the last
* Passing '' as XPath namespace prefix did not raise an error
* Thread safety in XPath evaluators
* objectify.PyType for None is now called "NoneType"
* ``el.getiterator()`` renamed to ``el.iter()``, following ElementTree 1.3 -
original name is still available as alias
* In the public C-API, ``findOrBuildNodeNs()`` was replaced by the more
* Major refactoring in XPath/XSLT extension function code
* Network access in parsers disabled by default
More information about the Python-announce-list