Hello I'm writing some code to check whether our daily backups worked. Backup Exec stores its results in XML files. Sometimes bad characters - or maybe it is binary data - ends up in these XML files and then lxml chokes: C:\>python sb-lxml.py Traceback (most recent call last): File "sb-lxml.py", line 5, in <module> Xml = etree.parse(XmlFileName) File "lxml.etree.pyx", line 2520, in lxml.etree.parse (src/lxml/lxml.etree.c:22062) File "parser.pxi", line 1309, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:53088) File "parser.pxi", line 1338, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:53337) File "parser.pxi", line 1248, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:52584) File "parser.pxi", line 828, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:50115) File "parser.pxi", line 452, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:47023) File "parser.pxi", line 536, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:47861) File "parser.pxi", line 478, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:47285) lxml.etree.XMLSyntaxError: PCDATA invalid Char value 11, line 132, column 95 The offending line looks like this (not sure if the bad characters will make it through the email): </error><error>Directory not found. Can not backup directory \Data\\l Strategy - Progress Rep.doc\\\??ā?\\VIC-ve\TT\miscellaneous and its subdirectories. Example code to demonstrate how I use it (with lxml-2.0.5 and Python 2.5.2): ################################## Xml = etree.parse(XmlFileName) print Xml.findtext(".//end_time") print Xml.findtext(".//engine_completion_status") ############################## The code works fine unless there are invalid characters in, and I am happy for any suggestion, because the bit I'm interested in is always near the end of the xml file, and there should be a way to get it reliably regardless of the gunk elsewhere in the file (or that's what I hope) Also, I've tried the 'recover' parser option, but I'm doing something wrong, because I get this: C:\>python sb-lxml.py Traceback (most recent call last): File "sb-lxml.py", line 9, in <module> print Xml.findtext(".//end_time") File "lxml.etree.pyx", line 1656, in lxml.etree._ElementTree.findtext (src/lxml/lxml.etree.c:15354) File "lxml.etree.pyx", line 1489, in lxml.etree._ElementTree._assertHasRoot (src/lxml/lxml.etree.c:14116) AssertionError: ElementTree not initialized, missing root The code I tried for the 'recover' parser option: XmlFileName = r'c:/BEX03194.xml' parser = etree.XMLParser(recover=True) Xml = etree.parse(StringIO(XmlFileName), parser) print Xml.findtext(".//end_time") print Xml.findtext(".//engine_completion_status") I guess I'm just specifying the option wrong, but can't see how I should be doing it. Any suggestion, including how to circumvent/work around the problem is most welcome. ReplyReply AllForwardTrash ____________________________________________________________ FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks & orcas on your desktop! Check it out at http://www.inbox.com/marineaquarium