xml processing : too slow...
Alex Martelli
aleax at aleax.it
Thu Jul 25 10:06:14 EDT 2002
Shagshag13 wrote:
> i think i still miss something as i get :
>
>>>> line = 'this <tag1><tag2>is</tag2></tag1> an example of the kind of
>>>> <tag3>line</tag3> that <tag3 attribute="this is a value">i
> could have</tag3> and that i must check (could also contain 0-9, $ and
> punctuations)...'
This line, by itself, is not a well formed XML document.
> this time line is really representative of the kind of suff i had
> (sentences with tag) (i had written a wrapper to handle this with
> find('<'), find('>'), and i check well formedness by using a stack...)
>
> but i would really understand why it doesn't work !
I would suggest you study some XML. A well formed XML document
has, among its requirements, a single top-level element that
envelops all other elements and textual content.
The above example line does not meet this specification.
That's part of the reason I keep asking you for *SPECS* rather
than the *EXAMPLES* you keep giving.
Anyway, from the above line you can obtain a well-formed
XML document by wrapping it all in start and end tags for
a fictitious XML toplevel element, e.g.:
p.parse('<fict>%s</fict>' % line, 1)
should be satisfactory for checking this kind of "sort of
well-formedness", unless there are yet more specs as yet
unexpressed.
Not sure how you intend to check well-formedness by just
finding open and closed triangular brackets and a stack.
How would that help you diagnosed e.g.
<bah thisis=notvalid>of course not</bah>
as not being well formed? This is not well formed because
it lacks quotes around an attribute's value. Or:
<bah thisis="notvalid">&either</bah>
now THIS is not well formed because reference '&either'
is not terminated with a semicolon. Etc, etc.
expat, of course, has no trouble diagnosing any of these.
If there are yet more kinds of non-well-formedness that you
need to tolerate, besides the lack of a single top-level
element, then of course you should not use an XML parser,
since your needs become very far from XML's specifications.
But _we_ can't know, unless you DO tell us the specs.
Alex
More information about the Python-list
mailing list