[XML-SIG] parsers and XML

travish travish@realtime.net
Thu, 10 Aug 2000 14:40:08 -0500 (CDT)

> | a) most of the XML "parsers" act appear to be lexers
> You mean, since they don't build complete document trees?

I mean since they appear to be lexers:

lexer -->
lexical analyser
<language> (Or "scanner") The initial input stage of a language
processor (e.g. a compiler), the part that performs lexical analysis.

lexical analysis
<programming> (Or "linear analysis", "scanning") The first stage
of processing a language. The stream of characters making up the
source program or other input is read one at a time and grouped
into lexemes (or "tokens") - word-like pieces such as keywords,
identifiers, literals and punctutation. The lexemes are then passed
to the parser.

["Compilers - Principles, Techniques and Tools", by Alfred V. Aho,
Ravi Sethi and Jeffrey D. Ullman, pp. 4-5]

> This is so
> because XML has a much simpler structure (and potentially much greater
> sizes) than what parsers traditionally have parsed.

I'm not so sure; I've compiled very large C files before.

> This makes an event-based API very useful.

The "event-based API" bears a striking resemblance to a lexer, and is
usually only useful if you do a certain amount of state-tracking yourself.
(e.g. how many levels of tags deep am I, and which tags are they?)
That is the traditional role of a parser, and the "event-driven API" apparently
does none of it.

> In Python we have so far chosen to make tree building separate utilities.

And reasonably so.

> If you want a document tree, look at 4DOM or qp_xml.

Actually, I want something between the two APIs that appear to be present
(lexing and generating an AST).  For example, in the reduce phase
of a shift-reduce parser like yacc (which corresponds to a close-tag
event from an "event driven API"), one is given the ability to
'condense' all of the subtrees of this particular node, requiring
neither a full AST nor keeping track of the stack of nested tags
you may currently be processing in.  This would be extremely handy
for (e.g.) converting XML to nested data structures.

> | b) none of the examples are of sufficient/substantial complexity
> |    (e.g. recursive nesting, deep/complex hierarchy)
> | 
> |    If anyone has suggestions on what kind of parser to use as a back
> |    end (yapps?  kjParsing?  etc.) I'd be interested to hear it.
> I don't understand this question.

Meaning, how does one utilize the existing "real" parsers to quickly and
robustly do the work which seem to be required by the "event-driven API",
namely keeping track of which tags one is in, and correlating those to
actions to take.  This is a solved problem, and has been so for decades.

All of the example I've seen have a fixed, shallow tag hierarchy and so
are toy problems which don't encounter these complexities.

> The diffs seem to be for the pyexpat driver. This has nothing to do
> with sgmlop or xmllib. 

Perhaps you should look a little more carefully before sending back such
a pointed response.

> What is the problem with the description?

For one thing, it appears that the character accumulation callback has
a different signature than the other parsers, passing only one argument
instead of three (charstr, start, len).  If so, that hardly makes sgmlop
replace the other parsers invisibly.
Those who will not reason, are bigots, those who cannot,
    are fools, and those who dare not, are slaves.
       - George Gordon Noel Byron (1788-1824)