[XML-SIG] parsers and XML
travish
travish@realtime.net
Thu, 10 Aug 2000 14:40:08 -0500 (CDT)
> | a) most of the XML "parsers" act appear to be lexers
>
> You mean, since they don't build complete document trees?
I mean since they appear to be lexers:
http://nightflight.com/cgi-bin/foldoc.cgi?query=lexer
lexer -->
lexical analyser
<language> (Or "scanner") The initial input stage of a language
processor (e.g. a compiler), the part that performs lexical analysis.
http://nightflight.com/cgi-bin/foldoc.cgi?lexical+analysis
lexical analysis
<programming> (Or "linear analysis", "scanning") The first stage
of processing a language. The stream of characters making up the
source program or other input is read one at a time and grouped
into lexemes (or "tokens") - word-like pieces such as keywords,
identifiers, literals and punctutation. The lexemes are then passed
to the parser.
["Compilers - Principles, Techniques and Tools", by Alfred V. Aho,
Ravi Sethi and Jeffrey D. Ullman, pp. 4-5]
> This is so
> because XML has a much simpler structure (and potentially much greater
> sizes) than what parsers traditionally have parsed.
I'm not so sure; I've compiled very large C files before.
> This makes an event-based API very useful.
The "event-based API" bears a striking resemblance to a lexer, and is
usually only useful if you do a certain amount of state-tracking yourself.
(e.g. how many levels of tags deep am I, and which tags are they?)
That is the traditional role of a parser, and the "event-driven API" apparently
does none of it.
> In Python we have so far chosen to make tree building separate utilities.
And reasonably so.
> If you want a document tree, look at 4DOM or qp_xml.
Actually, I want something between the two APIs that appear to be present
(lexing and generating an AST). For example, in the reduce phase
of a shift-reduce parser like yacc (which corresponds to a close-tag
event from an "event driven API"), one is given the ability to
'condense' all of the subtrees of this particular node, requiring
neither a full AST nor keeping track of the stack of nested tags
you may currently be processing in. This would be extremely handy
for (e.g.) converting XML to nested data structures.
> | b) none of the examples are of sufficient/substantial complexity
> | (e.g. recursive nesting, deep/complex hierarchy)
> |
> | If anyone has suggestions on what kind of parser to use as a back
> | end (yapps? kjParsing? etc.) I'd be interested to hear it.
>
> I don't understand this question.
Meaning, how does one utilize the existing "real" parsers to quickly and
robustly do the work which seem to be required by the "event-driven API",
namely keeping track of which tags one is in, and correlating those to
actions to take. This is a solved problem, and has been so for decades.
All of the example I've seen have a fixed, shallow tag hierarchy and so
are toy problems which don't encounter these complexities.
> The diffs seem to be for the pyexpat driver. This has nothing to do
> with sgmlop or xmllib.
Perhaps you should look a little more carefully before sending back such
a pointed response.
> What is the problem with the description?
For one thing, it appears that the character accumulation callback has
a different signature than the other parsers, passing only one argument
instead of three (charstr, start, len). If so, that hardly makes sgmlop
replace the other parsers invisibly.
--
Those who will not reason, are bigots, those who cannot,
are fools, and those who dare not, are slaves.
- George Gordon Noel Byron (1788-1824)