[XML-SIG] PyXPath 1.1
Martin v. Loewis
martin@loewis.home.cs.tu-berlin.de
Thu, 14 Dec 2000 12:09:43 +0100
As promised earlier, I tried to use another alternative parser toolkit
for parsing XPath LocationPath expressions. Since Uche proposed to use
Spark, that's what I did.
As I result, I can now announce PyXPath 1.1, which is available from
http://www.informatik.hu-berlin.de/~loewis/xml/PyXPath-1.1.tgz
The major change over the previous version is the xpathspark module,
which requires spark.py from Spark 0.6.1 (I decided not to include
spark; let me know if you think I should).
As with the YAPPS parser, the Spark parser generates an ad-hoc syntax
tree, namely a nested list consisting of the rhs tupel for each
production that was applied. Only in trivial cases, I modified the
list, such as unwrapping a list of a single item.
With the three parsers which I now have (the YAPPS parser, 4XPath, and
the Spark parser), I performed some measurements. I took the list of
the LocationPath examples from the recommendation and asked each
parser to parse each expression 10 respectively 100 times. On a AMD K6
with 350 MHz, using Linux 2.4t7, glibc 2.2, and Python 2.0, I got the
following results:
10 iterations:
4XPath 1.58s
YAPPS 1.43s
YAPPS with pre 2.31s
Spark 12.58s
100 iterations:
4XPath 5.16s
YAPPS 12.35s
YAPPS with pre 22.54s
Spark 124.92s
In these numbers, "pre" is the PCRE regex module of 1.5.2, but still
executed in Python 2; the default is sre.
=46rom these numbers, I conclude:
- sre is significantly faster than pre, so Python 2.0 is better for
processing regular expressions than 1.5.2. Even when parsing from a
Unicode string, the parser does not get much slower (numbers not
shown here).
- Spark is an order of magnitude slower than YAPPS. The Spark
documentation suggests that the parsing algorithm used in Spark is
quite general, but also quite slow. YAPPS used a recursive-descent
LL(1) parsing, which seems to win easily.
- The pure Python solution takes twice as much time as bison/flex
solution. Note that for parsing a "small" number of expressions
(300), the startup time of 4XPath overweights the parsing time, so
the YAPPS parser is actually faster here. That may change once the
YAPPS parser generates the same structure as 4XPath. IMO, this
overhead is a fair price to pay for the increased portability, the
Unicode support and the thread-safety of the Python solution.
Unless somebody can suggest more parser generators to try (*), I'd now
proceed with making the YAPPS parser 4XPath compatible.
Regards,
Martin
(*) Be aware that any alternative parser generator should support:
- tokenization of Unicode strings, either via an external lexer, or on
its own using the re module
- support for LL(1) or LALR(1) grammars.
- ideally be pure Python, although an addition C module is acceptable
as long as the resulting parser is still thread-safe.