[Python-Dev] RE: Python-Dev digest, Vol 1 #1637 - 11 msgs

Andy Robinson andy@reportlab.com
Tue, 2 Oct 2001 06:53:40 +0100


> If you want a very fast validating XML parser, RXP would also
> be a good choice -- AFAIK, the RXP folks would allow us to
> ship RXP under a different license than GPL which is then
> bound to Python.
> 
> Given the many alternatives, I am not sure whether going with
> expat is the right path... may be wrong though.
> 

Lucky I tuned in.  Reportlab has had great success with RXP.
We have a python wrapper, pyRXP, with binaries available for
several platforms.

It is GPLed at present.  They wish to keep GPL just in
case someone big comes along and wants their code for ten million
set-top boxes or something.  However, I persuaded them to grant
a license to let it be used through the Python binding under 
Python-like terms, as long as we invent the words and save 
them having to waste time on lawyers.  They would even be
happy for it to go into the Python distribution.  And 
we're happy to maintain the wrapper and binaries for 
several platforms, which we have to do for our customers 
anyway.

If one of the core Python team, who I know have long and 
painful experience of this stuff, would like to drop me a line, 
we can probably sort this out in a night.  

The other thing we found very useful was our representation.
We make reports, and ML is a common data source; so our goal
is typically to slurp XML into memory as fast as possible,
with validation.  We eventually hit on a 'tuple tree':
each tag is represented as 
  (tagname, attrs, list-of-children, spare)
We get there about 6x faster than the fastest alternative
parser we know, because all the work is done in C; with
typical use of other parsers you call back into Python
on every tag.  The tree structure is a fraction of the
size in memory of what gets created by models using
objects for every node.  It would be very easy to add
this as an alternative interface to expat as well.  So
then Python users could have a choice of tree or events,
and validating or non-validating, all done in C and
in the standard distribution.

Andy Robinson
CEO/Chief Architect, Reportlab Inc.