Daily WTF with XML, or error handling in SAX
mrkafk at gmail.com
mrkafk at gmail.com
Sat May 3 16:50:10 EDT 2008
So I set out to learn handling three-letter-acronym files in Python,
and SAX worked nicely until I encountered badly formed XMLs, like with
bad characters in it (well Unicode supposed to handle it all but
apparently doesn't), using http://dchublist.com/hublist.xml.bz2 as
example data, with goal to extract Users and Address properties where
number of Users is greater than given number.
So I extended my First XML Example with an error handler:
# ========= snip ===========
from xml.sax import make_parser
from xml.sax.handler import ContentHandler
from xml.sax.handler import ErrorHandler
class HubHandler(ContentHandler):
def __init__(self, hublist):
self.Address = ''
self.Users = ''
hl = hublist
def startElement(self, name, attrs):
self.Address = attrs.get('Address',"")
self.Users = attrs.get('Users', "")
def endElement(self, name):
if name == "Hub" and int(self.Users) > 2000:
#print self.Address, self.Users
hl.append({self.Address: int(self.Users)})
class HubErrorHandler(ErrorHandler):
def __init__(self):
pass
def error(self, exception):
import sys
print "Error, exception: %s\n" % exception
def fatalError(self, exception):
print "Fatal Error, exception: %s\n" % exception
hl = []
parser = make_parser()
hHandler = HubHandler(hl)
errHandler = HubErrorHandler()
parser.setContentHandler(hHandler)
parser.setErrorHandler(errHandler)
fh = file('hublist.xml')
parser.parse(fh)
def compare(x,y):
if x.values()[0] > y.values()[0]:
return 1
elif x.values()[0] < y.values()[0]:
return -1
return 0
hl.sort(cmp=compare, reverse=True)
for h in hl:
print h.keys()[0], " ", h.values()[0]
# ========= snip ===========
And then BAM, Pythonwin has hit me:
>>> execfile('ph.py')
Fatal Error, exception: hublist.xml:2247:11: not well-formed (invalid
token)
Fatal Error, exception: hublist.xml:2247:11: not well-formed (invalid
token)
Fatal Error, exception: hublist.xml:2247:11: not well-formed (invalid
token)
Fatal Error, exception: hublist.xml:2247:11: not well-formed (invalid
token)
Fatal Error, exception: hublist.xml:2247:11: not well-formed (invalid
token)
>>> ================================ RESTART ================================
Just before the "RESTART" line, Windows has announced it killed
pythonw.exe process (I suppose it was a child process).
WTF is happening here? Wasn't fatalError method in the HubErrorHandler
supposed to handle the invalid tokens? And why is the message repeated
many times? My method is called apparently, but something in SAX goes
awry and the interpreter crashes.
More information about the Python-list
mailing list