[XML-SIG] libxml support for Python
Thu, 24 Jan 2002 17:23:55 -0800
Here are some notes I've written about supporting Python with
libxml (It's at http://xmlsoft.org.)
Any suggestions and ideas will be helpful and will be appreciated.
libxml Support for Python
I'm interested in providing support for Python built on libxml and
Since this is the libxml list, I'll concentrate on libxml. Although,
it is important to by able to use libxslt from Python, too.
We should consider providing support in the following areas:
* Support for the DOM interface built on libxml (or gdome?).
* Support for the SAX interface built on libxml.
* Support for XSLT built on libxslt. (Possibly a discussion for the
Support for DOM
I've build DOM support for Python by hand, i.e. manually written
wrapper functions types, etc that expose libxml's DOM support.
(Avaliable at http://www.rexx.com/~dkuhlman.) But it's weak.
I've also used SWIG to generate wrappers for the libxml DOM support.
Basically, I generated wrappers for the stuff in include/libxml/tree.h
and include/libxml/parser.h. It works "pretty" good. A bit of
* Because I used SWIG's shadow classes, the doc, nodes, and
attributes look, from Python's point of view like instances of
classes. So, walking the DOM tree is very easy and natural.
* One benefit of doing this -- The Python objects (xmlDoc, xmlNode,
xmlAttr) are proxies for the "real", underlying (libxml) C objects
and the linkages between objects are in the underlying C objects.
Therefore, this implementation does not suffer from the problems
caused by circular references in Python objects. (Note that I
believe that I also solved this problem in my hand-written
* More over, the Python objects are created and destroyed on the fly
and only on request. For example,
node = node.children
node = node.next
This code creates two nodes. Furthermore, when the value of
variable 'node' is over-written (and if there is no other
reference to that value), the Python object is destroyed. (For
non-Python people who are still reading, Python uses a reference
counting strategy for managing memory.) The up-shot is that this
implementation (and my hand-written one as well) enables Python
scripts to load and use large DOM trees with very little memory
over-head above that used by the libxml C objects.
* One qualification is that the interface is at the level of the
libxml, so it's a bit low level. For example, a long running
application would have to call a 'free' method, e.g. xmlFreeDoc,
which is not something a Python programmer would expect to have to
* For another qualification is that this implementation needs some
fix-up, because there are some kinds nodes in the tree that can
cause segment faults.
* And, the generated code is a bit large. I'm not sure that this is
a concern in a world where disk space is so cheap. It's possible
that we will want to trim and not generate code for a few things.
On the other hand, there may be additional libxml DOM related
capabilities that we would also want to expose (and which would
make it even larger. Catalogs (catalog.h)? Entities (entity.h)?
gdome -- Whoa. I thought the DOM support was in libxml. I'll have to
look into gdome. Can someone enlighten me on the relationship between
gdome and libxml DOM support. Does gdome support a newer version of
the DOM spec? Should we build DOM support for Python on top of libxml
or on top of gdome?
Summary -- I'll continue to work on the SWIG wrappers for the libxml
DOM interface. I'll try to fix a few problems that I've found and will
look into generating support for encodings, catalogs, and entities.
I'll also try to learn a bit more about gdome.
Support for SAX
I've built Python wrappers for the libxml SAX support by hand (i.e.
not generated by SWIG). (Avaliable at http://www.rexx.com/~dkuhlman.)
A bit of evaluation:
* Ease of use -- I've used it quite a bit and it seems quite easy to
use and usable. It's a trivial task to create a Python handler
class with methods like 'startElment', 'endElement', 'characters',
etc and then do the parse to catch those events.
* Efficiency -- The wrapper C code checks, at the beginning of the
parse, to determine which event handler methods are defined in the
handler class. Then, during the parse, the C code does not call
any Python code (or do look-ups) for those event handlers that are
_not_ defined. For example, if the method 'characters' is not
defined in the handler class, then the C code will not call the
Python code for event characters. So, processing should be quite
efficient when a minimum of work is done in Python. Purhaps
another way to say this is that there will be Python over-head
only where that over-head is needed.
Creating a parser driver for PyXML built on libxml seems like a very
good idea. There are several benefits to be gained from doing so:
* It would be fast, because most of the work would be done in C
* It would provide a validating parser.
* It would be both fast and validating. This is something that PyXML
(to my understanding) does not currently have. pyexpat and sgmlop
are fast (because they are implemented in C. And, xmlproc is a
validating parser. But no current driver is both.
Here are a couple of issues that we should keep in mind:
* Building libxml is a reasonable amount of work which not every
user of PyXML is likely to want to do. Therefore, we will most
likely want to package a libxml parser driver for PyXML as an
add-on, i.e. as something that a can be built and installed after
PyXML has been installed.
* For speed it would be advantageous to not execute call-backs (or
do call-back look-up) for event handler methods not define in the
Summary -- I'll start looking into and working on a parser driver (the
equivalent of pyexpat or sgmlop) for PyXML built on top of libxml.
Support for XSLT
I've built Python wrappers for libxslt. (Avaliable at
http://www.rexx.com/~dkuhlman.) I've had one user report a bug, which
I fixed. I've used it a reasonable amount. It's very easy to use.
One of my goals when I started my work in exposing libxml and libxslt
to Python was to provide an alternative source of XML support for
Python. My belief is that it adds credibility to the Python project to
have more than one source of support for something as important as
XML. So, I feel that it is important that we both support the PyXML
effort and that I provide independent support built on top of