[Tutor] program that processes tokenized words in xml
pan@uchicago.edu
pan@uchicago.edu
Tue May 6 19:45:01 2003
=A4=DE=A5=CE Alan Gauld <alan.gauld@blueyonder.co.uk>:
> I haven't checked but how does it handle recursive definitions?
> Like this, say:
>=20
> <person>
> <name>Jon</name>
> <son><person>
> <name>Fred</name>
> <son>None</son>
> </person>
> </son>
> </person>
>=20
> That's usually where regex based parsing of XML falls flat.
>=20
> Alan g.
Alan,
I think I misunderstood your question just now. You asked "how", but
not "if it can."
[1] First of all the xml doc is splitted into a list:
<person>
<name>
Jon
</name>
<son>
<person>
<name>
Fred
</name>
<son>
None
</son>
</person>
</son>
</person>
[2] Then the list is read one by one. A 'state-holder', pairs, is
used to determine how many 'pair' of <tag></tag> exists.=20
[3] When a header tag is found, (like <person>, but not </person> nor=20
Jon), 'pairs' increase by 1;=20
[4] a tailing tag (</person>, but not <person> nor Jon) found, 'pairs'=20
decrease by 1:
if item=3D=3D currentHead: pairs +=3D1
elif item=3D=3D currentTail: pairs -=3D1
[5] When pairs reaches 0, the </person> is the one that is the current
closing tail tag. This simple trick makes sure that it won't misunderstan=
d=20
an internal </person> as the closing </person>.=20
pan