[XML-SIG] lxml 2.0alpha1 released
stefan_ml at behnel.de
Mon Sep 3 09:29:43 CEST 2007
Gloria W wrote:
> Stefan, congratulations. This is definitely useful.
> Please talk a bit about the API, and how it differs/varies from
> or link to some examples.
The docs are full of doctest examples. However, as lxml.html is still pretty
new, its docs are not as comprehensive as those for lxml.etree yet.
> For example, the node nesting,
> the usage of a 'tail' for trailing text. I wonder if lxml offers more of
> a DOM compliant node nesting, or if it conforms to the
> conventions/oddities of ElemenTree.
lxml.etree aims for ElementTree compatibility, so these things work alike. The
above link describes the differences that we either cannot work around for
technical reasons (or performance reasons) or that are considerate decisions
where we think ElementTree is wrong.
Note that the ElementTree API is more and more becoming a basis for other APIs
in lxml. There is lxml.objectify, which replaces a lot of this API by
something that works more like Python objects themselves (a data binding
approach). lxml.html extends the API with a bunch of helper methods for link
handling and also changes the way the serialisation works to better adapt it
to HTML. There's also MathDOM, a MathML implementation, which was the original
reason for making lxml extensible at the Element level, back in the days of
lxml 0.7. The original idea was actually 'stolen' from Xist, although lxml has
definitely found its own way of dealing with it.
The one thing I like most about lxml is the tool integration. For example, you
can use the Element API in lxml.etree or lxml.objectify or lxml.html, with any
of the five path languages: ElementPath, ETXPath, XPath, CSS-Selectors or
I think this is a trend that should continue. Most XML/HTML formats can
benefit from specialised Element classes with specially adapted or added
methods, properties and even different tree behaviour, while still taking
advantage of all the other tools that lxml provides. The possibilities that
lxml offers here are close to unlimited (both at the Python level and at the C
level) - even with the 'oddities' (as you called it) of ElementTree. I
personally believe that .tail attributes are actually a big advantage, as the
ignorance of text nodes simplifies the tree model considerably (well, the
public one, not necessarily the internal one...)
> Also show us how it differs from BeautifulSoup, which has extremely
> robust unicode handling and mangled XML/HTML tag completion, but may
> benchmark a bit slower.
libxml2 does not have as robust support for HTML-like tag soup as
BeautifulSoup, but it does a pretty good job anyway. In lxml 2.0, lxml.html
comes with BeautifulSoup integration (as ElementTree does), so now you can
have both: a tag soup parser and all the features of lxml.
More information about the XML-SIG