[XML-SIG] PyXPath 1.1

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Thu, 14 Dec 2000 12:09:43 +0100


As promised earlier, I tried to use another alternative parser toolkit
for parsing XPath LocationPath expressions. Since Uche proposed to use
Spark, that's what I did.

As I result, I can now announce PyXPath 1.1, which is available from

http://www.informatik.hu-berlin.de/~loewis/xml/PyXPath-1.1.tgz

The major change over the previous version is the xpathspark module,
which requires spark.py from Spark 0.6.1 (I decided not to include
spark; let me know if you think I should).

As with the YAPPS parser, the Spark parser generates an ad-hoc syntax
tree, namely a nested list consisting of the rhs tupel for each
production that was applied. Only in trivial cases, I modified the
list, such as unwrapping a list of a single item.

With the three parsers which I now have (the YAPPS parser, 4XPath, and
the Spark parser), I performed some measurements. I took the list of
the LocationPath examples from the recommendation and asked each
parser to parse each expression 10 respectively 100 times. On a AMD K6
with 350 MHz, using Linux 2.4t7, glibc 2.2, and Python 2.0, I got the
following results:

10 iterations:
4XPath                 1.58s
YAPPS                  1.43s
YAPPS with pre         2.31s
Spark                 12.58s

100 iterations:
4XPath                 5.16s
YAPPS                 12.35s
YAPPS with pre        22.54s
Spark                124.92s

In these numbers, "pre" is the PCRE regex module of 1.5.2, but still
executed in Python 2; the default is sre.

=46rom these numbers, I conclude:

- sre is significantly faster than pre, so Python 2.0 is better for
  processing regular expressions than 1.5.2. Even when parsing from a
  Unicode string, the parser does not get much slower (numbers not
  shown here).

- Spark is an order of magnitude slower than YAPPS. The Spark
  documentation suggests that the parsing algorithm used in Spark is
  quite general, but also quite slow. YAPPS used a recursive-descent
  LL(1) parsing, which seems to win easily.

- The pure Python solution takes twice as much time as bison/flex
  solution. Note that for parsing a "small" number of expressions
  (300), the startup time of 4XPath overweights the parsing time, so
  the YAPPS parser is actually faster here. That may change once the
  YAPPS parser generates the same structure as 4XPath. IMO, this
  overhead is a fair price to pay for the increased portability, the
  Unicode support and the thread-safety of the Python solution.

Unless somebody can suggest more parser generators to try (*), I'd now
proceed with making the YAPPS parser 4XPath compatible.

Regards,
Martin

(*) Be aware that any alternative parser generator should support:
- tokenization of Unicode strings, either via an external lexer, or on
  its own using the re module
- support for LL(1) or LALR(1) grammars.
- ideally be pure Python, although an addition C module is acceptable
  as long as the resulting parser is still thread-safe.