[Tutor] program that processes tokenized words in xml

Paul Tremblay phthenry@earthlink.net
Tue May 6 16:50:01 2003


I like your code a lot! I have been using SAX to parse a lot of XML,
and was unaware you could use expat directly. I also didn't know that
you could use expat to parse just strings--rather than files.

I need to check that certain strings are valid in a parser I have
written. I had used regular expressions (knowing it was kind of a
kludge), but your code inspired this:

import xml.parsers.expat
parser = xml.parsers.expat.ParserCreate()
import sys

# simple function to validate the well-formedness of a string
def validate(data):
    try:
        parser.Parse(data)
        return 0
    except :
        sys.stderr.write('tagging text will result in invalid XML\n')
        return 1

data = '<tag>text</tag>'
valid = validate(data)

Is there a better way?

Also, how can I make my except statement more precise?

I tried:

except ExpatError:

with no success

Thanks

Paul


On Tue, May 06, 2003 at 04:39:45PM +0200, Magnus Lyckå wrote:
> 
> At 06:30 2003-05-06 -0700, Abdirizak abdi wrote:
> >Can anyone suggest how I can incorporate a  regular expression for
> >eliminating these tags?
> 
> Don't!
> 
> Regular expressions are not the right tool for the task if we are talking
> about XML parsing. There are finely crafted tools particularly for XML in
> Python. Use them instead.
> 
> Imagine you have XML data in a string like this and want to extract the
> names of the persons (but not the animals):
> 
> data='''<stuff>
>  <person>
>   <name>John Cleese</name><function>Funny</function>
>   <name>Basil Fawlty</name>
>  </person>
> 
>  <animal><name>Wanda</name><function>Fish</function></animal>
> 
>  <person>
>   <name>Eric Idle</name>
>   <function>Funny</function>
>  </person>
> </stuff>'''
> 
> Then we can do something like this...
> 
> import xml.parsers.expat
> 
> isPerson = False
> isName = False
> 
> def start_element(name, attrs):
>     global isPerson, isName
>     if name == 'person':
>         isPerson = True
>     elif name == 'name':
>         isName = True
> 
> def end_element(name):
>     global isPerson, isName
>     if name == 'person':
>         isPerson = False
>     elif name == 'name':
>         isName = False
> 
> def char_data(data):
>     if isPerson and isName:
>         print data
> 
> parser = xml.parsers.expat.ParserCreate()
> parser.StartElementHandler = start_element
> parser.EndElementHandler = end_element
> parser.CharacterDataHandler = char_data
> 
> parser.Parse(data)
> 
> ...and get:
> 
> John Cleese
> Basil Fawlty
> Eric Idle
> 
> This won't break if someone starts adding attributes to the
> name tags, or if anyone decides to format the file differently,
> so that the items you imagined were located on the same row,
> is suddenly divided over three rows. If two files represent the
> same content from an XML perspective, this program should also
> extract the same data. I don't think you can ever fix that with
> regular expressions. (At least it will be very hard work.)
> 
> A simple regular expression might well seem to solve the problem
> for you, with less code than expat etc, but it will probably be
> much more brittle than using a real XML parser.
> 
> I have made the assumption here that persons aren't nested inside
> persons, and that names aren't nested inside names, but as long as
> that's true, I think this should work as intended.
> 
> 
> --
> Magnus Lycka (It's really Lyck&aring;), magnus@thinkware.se
> Thinkware AB, Sweden, www.thinkware.se
> I code Python ~ The shortest path from thought to working program 
> 
> 
> _______________________________________________
> Tutor maillist  -  Tutor@python.org
> http://mail.python.org/mailman/listinfo/tutor

-- 

************************
*Paul Tremblay         *
*phthenry@earthlink.net*
************************