xml processing : too slow...

Alex Martelli aleax at aleax.it
Thu Jul 25 10:06:14 EDT 2002


Shagshag13 wrote:

> i think i still miss something as i get :
> 
>>>> line = 'this <tag1><tag2>is</tag2></tag1> an example of the kind of
>>>> <tag3>line</tag3> that <tag3 attribute="this is a value">i
> could have</tag3> and that i must check (could also contain 0-9, $ and
> punctuations)...'

This line, by itself, is not a well formed XML document.

> this time line is really representative of the kind of suff i had
> (sentences with tag) (i had written a wrapper to handle this with
> find('<'), find('>'), and i check well formedness by using a stack...)
> 
> but i would really understand why it doesn't work !

I would suggest you study some XML.  A well formed XML document
has, among its requirements, a single top-level element that
envelops all other elements and textual content.

The above example line does not meet this specification.

That's part of the reason I keep asking you for *SPECS* rather
than the *EXAMPLES* you keep giving.

Anyway, from the above line you can obtain a well-formed
XML document by wrapping it all in start and end tags for
a fictitious XML toplevel element, e.g.:

p.parse('<fict>%s</fict>' % line, 1)

should be satisfactory for checking this kind of "sort of
well-formedness", unless there are yet more specs as yet
unexpressed.

Not sure how you intend to check well-formedness by just
finding open and closed triangular brackets and a stack.

How would that help you diagnosed e.g.
        <bah thisis=notvalid>of course not</bah>
as not being well formed?  This is not well formed because
it lacks quotes around an attribute's value.  Or:
        <bah thisis="notvalid">&either</bah>
now THIS is not well formed because reference '&either'
is not terminated with a semicolon.  Etc, etc.

expat, of course, has no trouble diagnosing any of these.


If there are yet more kinds of non-well-formedness that you
need to tolerate, besides the lack of a single top-level
element, then of course you should not use an XML parser,
since your needs become very far from XML's specifications.

But _we_ can't know, unless you DO tell us the specs.


Alex




More information about the Python-list mailing list