[Tutor] program that processes tokenized words in xml

pan@uchicago.edu pan@uchicago.edu
Tue May 6 19:45:01 2003


=A4=DE=A5=CE Alan Gauld <alan.gauld@blueyonder.co.uk>:

> I haven't checked but how does it handle recursive definitions?
> Like this, say:
>=20
> <person>
>    <name>Jon</name>
>    <son><person>
>                 <name>Fred</name>
>                 <son>None</son>
>               </person>
>    </son>
>  </person>
>=20
> That's usually where regex based parsing of XML falls flat.
>=20
> Alan g.


Alan,

I think I misunderstood your question just now. You asked "how", but
not "if it can."

[1] First of all the xml doc is splitted into a list:

<person>
<name>
Jon
</name>
<son>
<person>
<name>
Fred
</name>
<son>
None
</son>
</person>
</son>
</person>

[2] Then the list is read one by one. A 'state-holder', pairs, is
used to determine how many 'pair' of <tag></tag> exists.=20

[3] When a header tag is found, (like <person>, but not </person> nor=20
Jon), 'pairs' increase by 1;=20

[4] a tailing tag (</person>, but not <person> nor Jon) found, 'pairs'=20
decrease by 1:

            if   item=3D=3D currentHead: pairs +=3D1
            elif item=3D=3D currentTail: pairs -=3D1

[5] When pairs reaches 0, the </person> is the one that is the current
closing tail tag. This simple trick makes sure that it won't misunderstan=
d=20
an internal </person> as the closing </person>.=20

pan