[lxml-dev] Getting info from an XML file that has invalid character data in it (and how to specify recover option)
Hello I'm writing some code to check whether our daily backups worked. Backup Exec stores its results in XML files. Sometimes bad characters - or maybe it is binary data - ends up in these XML files and then lxml chokes: C:\>python sb-lxml.py Traceback (most recent call last): File "sb-lxml.py", line 5, in <module> Xml = etree.parse(XmlFileName) File "lxml.etree.pyx", line 2520, in lxml.etree.parse (src/lxml/lxml.etree.c:22062) File "parser.pxi", line 1309, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:53088) File "parser.pxi", line 1338, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:53337) File "parser.pxi", line 1248, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:52584) File "parser.pxi", line 828, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:50115) File "parser.pxi", line 452, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:47023) File "parser.pxi", line 536, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:47861) File "parser.pxi", line 478, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:47285) lxml.etree.XMLSyntaxError: PCDATA invalid Char value 11, line 132, column 95 The offending line looks like this (not sure if the bad characters will make it through the email): </error><error>Directory not found. Can not backup directory \Data\\l Strategy - Progress Rep.doc\\\??ā?\\VIC-ve\TT\miscellaneous and its subdirectories. Example code to demonstrate how I use it (with lxml-2.0.5 and Python 2.5.2): ################################## Xml = etree.parse(XmlFileName) print Xml.findtext(".//end_time") print Xml.findtext(".//engine_completion_status") ############################## The code works fine unless there are invalid characters in, and I am happy for any suggestion, because the bit I'm interested in is always near the end of the xml file, and there should be a way to get it reliably regardless of the gunk elsewhere in the file (or that's what I hope) Also, I've tried the 'recover' parser option, but I'm doing something wrong, because I get this: C:\>python sb-lxml.py Traceback (most recent call last): File "sb-lxml.py", line 9, in <module> print Xml.findtext(".//end_time") File "lxml.etree.pyx", line 1656, in lxml.etree._ElementTree.findtext (src/lxml/lxml.etree.c:15354) File "lxml.etree.pyx", line 1489, in lxml.etree._ElementTree._assertHasRoot (src/lxml/lxml.etree.c:14116) AssertionError: ElementTree not initialized, missing root The code I tried for the 'recover' parser option: XmlFileName = r'c:/BEX03194.xml' parser = etree.XMLParser(recover=True) Xml = etree.parse(StringIO(XmlFileName), parser) print Xml.findtext(".//end_time") print Xml.findtext(".//engine_completion_status") I guess I'm just specifying the option wrong, but can't see how I should be doing it. Any suggestion, including how to circumvent/work around the problem is most welcome. ReplyReply AllForwardTrash ____________________________________________________________ FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks & orcas on your desktop! Check it out at http://www.inbox.com/marineaquarium
Hi, Ben wrote:
Xml = etree.parse(XmlFileName) ############################## XmlFileName = r'c:/BEX03194.xml' parser = etree.XMLParser(recover=True) Xml = etree.parse(StringIO(XmlFileName), parser)
Not sure if this is just a "find-a-short-example" error, but you parse the filename, not the file here. This should read Xml = etree.parse(XmlFileName, parser)
Also, I've tried the 'recover' parser option, but I'm doing something wrong, because I get this:
C:\>python sb-lxml.py Traceback (most recent call last): File "sb-lxml.py", line 9, in <module> print Xml.findtext(".//end_time") File "lxml.etree.pyx", line 1656, in lxml.etree._ElementTree.findtext (src/lxml/lxml.etree.c:15354) File "lxml.etree.pyx", line 1489, in lxml.etree._ElementTree._assertHasRoot (src/lxml/lxml.etree.c:14116) AssertionError: ElementTree not initialized, missing root
I guess that happens when the parser "recover"s from not finding any XML at all. Maybe we should still raise an exception in this case instead of returning an empty ElementTree. This is really an extreme case of broken data... Stefan
Stefan wrote:
Not sure if this is just a "find-a-short-example" error, but you parse the filename, not the file here. This should read
Xml = etree.parse(XmlFileName, parser)
(LOL) This is indeed a "find-a-short-example" error - which is what you use when you are a sysadmin. Now it works and gets me past the invalid characters too. Thanks for lxml
participants (2)
-
Ben
-
Stefan Behnel