[XML-SIG] SAX prettyprinter V2 and SGMLOP

Christian Tismer tismer@appliedbiometrics.com
Mon, 25 Jan 1999 22:58:04 +0100


Lars Marius Garshol wrote:
> 
> * Christian Tismer
> |
> | [About ot.xml]
> |
> | Interesting. I tested my Indenter with this file (what a nice
> | example).
> 
> A rather misleading one, I'm afraid, since it doesn't use entities,
> comments, PIs, marked sections or attributes, only elements and
> PCDATA.

Right, very simple.

> | It takes 11.75 seconds to indent this through SAX, using sgmlop.
> | With xmlproc, it takes 30.87 seconds.
> 
> Interesting. (And pleasing. :)

And then I wrote a simple plain vanilla indenter in
pure Python which does the same in 5 seconds.
Just splitting away, finding tags correctly, counting
levels, and doing nothing else at all.

I think this will not become much faster by using sgmlop,
so the test which you mentioned a while ago is obsolete.
5 seconds is the need for indentation, the rest is
gymnastics which is useless in this case.

> | Running the whole text through sgmlop without any associated events
> | ran in below one second.
> 
> It's worth noting that this is just the time for the raw parse. As far
> as I know, sgmlop will not call handlers if there aren't any and so
> this entire second will be spent in C source.

Right, this is the "naked" time.

> | I want to validate small amounts of newly added data "records" which
> | are in XML format, but then kept in a special repository, and I want
> | to be able to re-import large amounts of XML which were exported by
> | my app before. This means, I need a validating parser of acceptable
> | speed, where I think xmlproc is very good?
> 
> I think the Java parsers are probably faster, but xmlproc should be
> acceptable, yes.
> 
> When I release 0.60 the DTD parser and DTD objects are separated from
> the XML parser. This means that provided you can get the external and
> internal DTD subsets from expat it's possible to build an expat-based
> validator using the xmlproc sources. This will require a bit of work,
> though.
> 
> With DTD caching (scheduled for 0.61 in my current plans) you won't
> have to keep reparsing the DTD for each document either, thus saving
> even more speed. (Parse times for large DTDs such as TEI-XML take
> substantial amounts of time.)

I'm happy to hear this.

cheers - chris

-- 
Christian Tismer             :^)   <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaiserin-Augusta-Allee 101   :    *Starship* http://starship.skyport.net
10553 Berlin                 :     PGP key -> http://pgp.ai.mit.edu/
     we're tired of banana software - shipped green, ripens at home