[Tutor] program that processes tokenized words in xml

Magnus Lyckå magnus@thinkware.se
Tue May 6 10:39:00 2003


At 06:30 2003-05-06 -0700, Abdirizak abdi wrote:
>Can anyone suggest how I can incorporate a  regular expression for
>eliminating these tags?

Don't!

Regular expressions are not the right tool for the task if we are talking
about XML parsing. There are finely crafted tools particularly for XML in
Python. Use them instead.

Imagine you have XML data in a string like this and want to extract the
names of the persons (but not the animals):

data='''<stuff>
  <person>
   <name>John Cleese</name><function>Funny</function>
   <name>Basil Fawlty</name>
  </person>

  <animal><name>Wanda</name><function>Fish</function></animal>

  <person>
   <name>Eric Idle</name>
   <function>Funny</function>
  </person>
</stuff>'''

Then we can do something like this...

import xml.parsers.expat

isPerson = False
isName = False

def start_element(name, attrs):
     global isPerson, isName
     if name == 'person':
         isPerson = True
     elif name == 'name':
         isName = True

def end_element(name):
     global isPerson, isName
     if name == 'person':
         isPerson = False
     elif name == 'name':
         isName = False

def char_data(data):
     if isPerson and isName:
         print data

parser = xml.parsers.expat.ParserCreate()
parser.StartElementHandler = start_element
parser.EndElementHandler = end_element
parser.CharacterDataHandler = char_data

parser.Parse(data)

...and get:

John Cleese
Basil Fawlty
Eric Idle

This won't break if someone starts adding attributes to the
name tags, or if anyone decides to format the file differently,
so that the items you imagined were located on the same row,
is suddenly divided over three rows. If two files represent the
same content from an XML perspective, this program should also
extract the same data. I don't think you can ever fix that with
regular expressions. (At least it will be very hard work.)

A simple regular expression might well seem to solve the problem
for you, with less code than expat etc, but it will probably be
much more brittle than using a real XML parser.

I have made the assumption here that persons aren't nested inside
persons, and that names aren't nested inside names, but as long as
that's true, I think this should work as intended.


--
Magnus Lycka (It's really Lyck&aring;), magnus@thinkware.se
Thinkware AB, Sweden, www.thinkware.se
I code Python ~ The shortest path from thought to working program