[Tutor] program that processes tokenized words in xml
Paul Tremblay
phthenry@earthlink.net
Tue May 6 16:50:01 2003
I like your code a lot! I have been using SAX to parse a lot of XML,
and was unaware you could use expat directly. I also didn't know that
you could use expat to parse just strings--rather than files.
I need to check that certain strings are valid in a parser I have
written. I had used regular expressions (knowing it was kind of a
kludge), but your code inspired this:
import xml.parsers.expat
parser = xml.parsers.expat.ParserCreate()
import sys
# simple function to validate the well-formedness of a string
def validate(data):
try:
parser.Parse(data)
return 0
except :
sys.stderr.write('tagging text will result in invalid XML\n')
return 1
data = '<tag>text</tag>'
valid = validate(data)
Is there a better way?
Also, how can I make my except statement more precise?
I tried:
except ExpatError:
with no success
Thanks
Paul
On Tue, May 06, 2003 at 04:39:45PM +0200, Magnus Lyckå wrote:
>
> At 06:30 2003-05-06 -0700, Abdirizak abdi wrote:
> >Can anyone suggest how I can incorporate a regular expression for
> >eliminating these tags?
>
> Don't!
>
> Regular expressions are not the right tool for the task if we are talking
> about XML parsing. There are finely crafted tools particularly for XML in
> Python. Use them instead.
>
> Imagine you have XML data in a string like this and want to extract the
> names of the persons (but not the animals):
>
> data='''<stuff>
> <person>
> <name>John Cleese</name><function>Funny</function>
> <name>Basil Fawlty</name>
> </person>
>
> <animal><name>Wanda</name><function>Fish</function></animal>
>
> <person>
> <name>Eric Idle</name>
> <function>Funny</function>
> </person>
> </stuff>'''
>
> Then we can do something like this...
>
> import xml.parsers.expat
>
> isPerson = False
> isName = False
>
> def start_element(name, attrs):
> global isPerson, isName
> if name == 'person':
> isPerson = True
> elif name == 'name':
> isName = True
>
> def end_element(name):
> global isPerson, isName
> if name == 'person':
> isPerson = False
> elif name == 'name':
> isName = False
>
> def char_data(data):
> if isPerson and isName:
> print data
>
> parser = xml.parsers.expat.ParserCreate()
> parser.StartElementHandler = start_element
> parser.EndElementHandler = end_element
> parser.CharacterDataHandler = char_data
>
> parser.Parse(data)
>
> ...and get:
>
> John Cleese
> Basil Fawlty
> Eric Idle
>
> This won't break if someone starts adding attributes to the
> name tags, or if anyone decides to format the file differently,
> so that the items you imagined were located on the same row,
> is suddenly divided over three rows. If two files represent the
> same content from an XML perspective, this program should also
> extract the same data. I don't think you can ever fix that with
> regular expressions. (At least it will be very hard work.)
>
> A simple regular expression might well seem to solve the problem
> for you, with less code than expat etc, but it will probably be
> much more brittle than using a real XML parser.
>
> I have made the assumption here that persons aren't nested inside
> persons, and that names aren't nested inside names, but as long as
> that's true, I think this should work as intended.
>
>
> --
> Magnus Lycka (It's really Lyckå), magnus@thinkware.se
> Thinkware AB, Sweden, www.thinkware.se
> I code Python ~ The shortest path from thought to working program
>
>
> _______________________________________________
> Tutor maillist - Tutor@python.org
> http://mail.python.org/mailman/listinfo/tutor
--
************************
*Paul Tremblay *
*phthenry@earthlink.net*
************************