Newbie XML SAX Parsing: How do I ignore an invalid token?
scott at crybabymaternity.com
scott at crybabymaternity.com
Fri Jan 5 22:30:26 EST 2007
My original posting has a funky line break character (it appears as an
ascii square) that blows up my program, but it may or may not show up
when you view my message.
I was afraid to use element tree, since my xml files can be very long,
and I was concerned about using memory structures to hold all the data.
It is my understanding that SAX reads the file line by line?
Is there a way to account for the invalid token in the error handler? I
don't mind parsing out the bad characters on a case-by-case basis. The
weather data I am ingesting only seems to have this line break
character that the parser doesn't like.
Thanks!
Scott
Chris Lambacher wrote:
> What exactly is invalid about the XML fragment you provided?
> It seems to parse correctly with ElementTree:
> >>> from xml.etree import ElementTree as ET
> >>> e = ET.fromstring("""
> ... <cities>
> ... <city>
> ... <name>Tampa</name>
> ... <description>A great city ^^ and place to live</description>
> ... </city>
> ... <city>
> ... <name>Clearwater</name>
> ... <description>Beautiful beaches</description>
> ... </city>
> ... </cities>
> ... """)
> >>> print ET.tostring(e)
> <cities>
> <city>
> <name>Tampa</name>
> <description>A great city ^^ and place to live</description>
> </city>
> <city>
> <name>Clearwater</name>
> <description>Beautiful beaches</description>
> </city>
> </cities>
> >>>
>
> Do you have invalid characters? unclosed tags? The solution to each of these
> problems is different. More info will solicit better solutions.
>
> -Chris
>
> On Fri, Jan 05, 2007 at 01:50:18PM -0800, scott at crybabymaternity.com wrote:
> > I've got an XML feed from a vendor that is not well-formed, and having
> > them change it is not an option. I'm trying to figure out how to
> > create an error-handler that will ignore the invalid token and continue
> > on.
> >
> > The file is large, so I'd prefer not to put it all in memory or save it
> > off and strip out the bad characters before I parse it.
> >
> > I've included one of the problematic characters in a small XML snippet
> > below.
> >
> > I'm new to Python, and I don't know how to accomplish this. Any help is
> > greatly appreciated!
> >
> > -----------------------------------------------------------------
> >
> > Here is my code:
> >
> > from xml.sax import make_parser
> > from xml.sax.handler import ContentHandler
> > import StringIO
> >
> > class ErrorHandler:
> > def __init__(self, parser):
> > self.parser = parser
> > def warning(self, msg):
> > print '*** (ErrorHandler.warning) msg:', msg
> > def error(self, msg):
> > print '*** (ErrorHandler.error) msg:', msg
> > def fatalError(self, msg):
> > print msg
> >
> > class ContentHandler(ContentHandler):
> > def __init__ (self):
> > pass
> > def startElement(self, name, attrs):
> > pass
> > def characters (self, ch):
> > pass
> > def endElement(self, name):
> > pass
> >
> > xmlstr = """
> > <cities>
> > <city>
> > <name>Tampa</name>
> > <description>A great city and place to live</description>
> > </city>
> > <city>
> > <name>Clearwater</name>
> > <description>Beautiful beaches</description>
> > </city>
> > </cities>
> > """
> > parser = make_parser()
> > curHandler = ContentHandler()
> > errorHandler = ErrorHandler(parser)
> > parser.setContentHandler(curHandler)
> > parser.setErrorHandler(errorHandler)
> > parser.parse(StringIO.StringIO(xmlstr))
> >
> > --
> > http://mail.python.org/mailman/listinfo/python-list
More information about the Python-list
mailing list