Newbie XML SAX Parsing: How do I ignore an invalid token?
Chris Lambacher
chris at kateandchris.net
Fri Jan 5 17:45:45 EST 2007
What exactly is invalid about the XML fragment you provided?
It seems to parse correctly with ElementTree:
>>> from xml.etree import ElementTree as ET
>>> e = ET.fromstring("""
... <cities>
... <city>
... <name>Tampa</name>
... <description>A great city ^^ and place to live</description>
... </city>
... <city>
... <name>Clearwater</name>
... <description>Beautiful beaches</description>
... </city>
... </cities>
... """)
>>> print ET.tostring(e)
<cities>
<city>
<name>Tampa</name>
<description>A great city ^^ and place to live</description>
</city>
<city>
<name>Clearwater</name>
<description>Beautiful beaches</description>
</city>
</cities>
>>>
Do you have invalid characters? unclosed tags? The solution to each of these
problems is different. More info will solicit better solutions.
-Chris
On Fri, Jan 05, 2007 at 01:50:18PM -0800, scott at crybabymaternity.com wrote:
> I've got an XML feed from a vendor that is not well-formed, and having
> them change it is not an option. I'm trying to figure out how to
> create an error-handler that will ignore the invalid token and continue
> on.
>
> The file is large, so I'd prefer not to put it all in memory or save it
> off and strip out the bad characters before I parse it.
>
> I've included one of the problematic characters in a small XML snippet
> below.
>
> I'm new to Python, and I don't know how to accomplish this. Any help is
> greatly appreciated!
>
> -----------------------------------------------------------------
>
> Here is my code:
>
> from xml.sax import make_parser
> from xml.sax.handler import ContentHandler
> import StringIO
>
> class ErrorHandler:
> def __init__(self, parser):
> self.parser = parser
> def warning(self, msg):
> print '*** (ErrorHandler.warning) msg:', msg
> def error(self, msg):
> print '*** (ErrorHandler.error) msg:', msg
> def fatalError(self, msg):
> print msg
>
> class ContentHandler(ContentHandler):
> def __init__ (self):
> pass
> def startElement(self, name, attrs):
> pass
> def characters (self, ch):
> pass
> def endElement(self, name):
> pass
>
> xmlstr = """
> <cities>
> <city>
> <name>Tampa</name>
> <description>A great city and place to live</description>
> </city>
> <city>
> <name>Clearwater</name>
> <description>Beautiful beaches</description>
> </city>
> </cities>
> """
> parser = make_parser()
> curHandler = ContentHandler()
> errorHandler = ErrorHandler(parser)
> parser.setContentHandler(curHandler)
> parser.setErrorHandler(errorHandler)
> parser.parse(StringIO.StringIO(xmlstr))
>
> --
> http://mail.python.org/mailman/listinfo/python-list
More information about the Python-list
mailing list