[Tutor] program that processes tokenized words in xml

Paul Tremblay phthenry@earthlink.net
Wed May 7 00:09:01 2003


Your code breaks if I use uniocde. 

data = u'text \u201c'

parser.Parse(data)

Traceback (most recent call last):
  File "/home/paul/lib/python/paul/xml/expat.py", line 50, in ?
    parser.Parse(data)
UnicodeError: ASCII encoding error: ordinal not in range(128)

Interstingly, it is not expat itself that seems to be raising the
error. Any idea what is going on?

Thanks

Paul

On Tue, May 06, 2003 at 04:39:45PM +0200, Magnus Lyckå wrote:
> 
> At 06:30 2003-05-06 -0700, Abdirizak abdi wrote:
> >Can anyone suggest how I can incorporate a  regular expression for
> >eliminating these tags?
> 
> Don't!
> 
> Regular expressions are not the right tool for the task if we are talking
> about XML parsing. There are finely crafted tools particularly for XML in
> Python. Use them instead.
> 
> Imagine you have XML data in a string like this and want to extract the
> names of the persons (but not the animals):
> 
> data='''<stuff>
>  <person>
>   <name>John Cleese</name><function>Funny</function>
>   <name>Basil Fawlty</name>
>  </person>
> 
>  <animal><name>Wanda</name><function>Fish</function></animal>
> 
>  <person>
>   <name>Eric Idle</name>
>   <function>Funny</function>
>  </person>
> </stuff>'''
> 
> Then we can do something like this...
> 
> import xml.parsers.expat
> 
> isPerson = False
> isName = False
> 
> def start_element(name, attrs):
>     global isPerson, isName
>     if name == 'person':
>         isPerson = True
>     elif name == 'name':
>         isName = True
> 
> def end_element(name):
>     global isPerson, isName
>     if name == 'person':
>         isPerson = False
>     elif name == 'name':
>         isName = False
> 
> def char_data(data):
>     if isPerson and isName:
>         print data
> 
> parser = xml.parsers.expat.ParserCreate()
> parser.StartElementHandler = start_element
> parser.EndElementHandler = end_element
> parser.CharacterDataHandler = char_data
> 
> parser.Parse(data)
> 
> ...and get:
> 
> John Cleese
> Basil Fawlty
> Eric Idle
> 
> This won't break if someone starts adding attributes to the
> name tags, or if anyone decides to format the file differently,
> so that the items you imagined were located on the same row,
> is suddenly divided over three rows. If two files represent the
> same content from an XML perspective, this program should also
> extract the same data. I don't think you can ever fix that with
> regular expressions. (At least it will be very hard work.)
> 
> A simple regular expression might well seem to solve the problem
> for you, with less code than expat etc, but it will probably be
> much more brittle than using a real XML parser.
> 
> I have made the assumption here that persons aren't nested inside
> persons, and that names aren't nested inside names, but as long as
> that's true, I think this should work as intended.
> 
> 
> --
> Magnus Lycka (It's really Lyck&aring;), magnus@thinkware.se
> Thinkware AB, Sweden, www.thinkware.se
> I code Python ~ The shortest path from thought to working program 
> 
> 
> _______________________________________________
> Tutor maillist  -  Tutor@python.org
> http://mail.python.org/mailman/listinfo/tutor

-- 

************************
*Paul Tremblay         *
*phthenry@earthlink.net*
************************