[Tutor] program that processes tokenized words in xml
Magnus Lyckå
magnus@thinkware.se
Tue May 6 10:39:00 2003
At 06:30 2003-05-06 -0700, Abdirizak abdi wrote:
>Can anyone suggest how I can incorporate a regular expression for
>eliminating these tags?
Don't!
Regular expressions are not the right tool for the task if we are talking
about XML parsing. There are finely crafted tools particularly for XML in
Python. Use them instead.
Imagine you have XML data in a string like this and want to extract the
names of the persons (but not the animals):
data='''<stuff>
<person>
<name>John Cleese</name><function>Funny</function>
<name>Basil Fawlty</name>
</person>
<animal><name>Wanda</name><function>Fish</function></animal>
<person>
<name>Eric Idle</name>
<function>Funny</function>
</person>
</stuff>'''
Then we can do something like this...
import xml.parsers.expat
isPerson = False
isName = False
def start_element(name, attrs):
global isPerson, isName
if name == 'person':
isPerson = True
elif name == 'name':
isName = True
def end_element(name):
global isPerson, isName
if name == 'person':
isPerson = False
elif name == 'name':
isName = False
def char_data(data):
if isPerson and isName:
print data
parser = xml.parsers.expat.ParserCreate()
parser.StartElementHandler = start_element
parser.EndElementHandler = end_element
parser.CharacterDataHandler = char_data
parser.Parse(data)
...and get:
John Cleese
Basil Fawlty
Eric Idle
This won't break if someone starts adding attributes to the
name tags, or if anyone decides to format the file differently,
so that the items you imagined were located on the same row,
is suddenly divided over three rows. If two files represent the
same content from an XML perspective, this program should also
extract the same data. I don't think you can ever fix that with
regular expressions. (At least it will be very hard work.)
A simple regular expression might well seem to solve the problem
for you, with less code than expat etc, but it will probably be
much more brittle than using a real XML parser.
I have made the assumption here that persons aren't nested inside
persons, and that names aren't nested inside names, but as long as
that's true, I think this should work as intended.
--
Magnus Lycka (It's really Lyckå), magnus@thinkware.se
Thinkware AB, Sweden, www.thinkware.se
I code Python ~ The shortest path from thought to working program