SAXParseException: not well-formed (invalid token)

Thu Aug 30 08:37:10 EDT 2007

Pablo Rey wrote:
>     I am getting the following error with a XML page:
> 
>>   File "/home/prey/RAL-CESGA/bin/voms2users/voms2users.py", line 69,
>> in getItems
>>     d = minidom.parseString(xml.read())
>>   File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py",
>> line 967, in parseString
>>     return _doparse(pulldom.parseString, args, kwargs)
>>   File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py",
>> line 954, in _doparse
>>     toktype, rootNode = events.getEvent()
>>   File "/usr/lib/python2.2/site-packages/_xmlplus/dom/pulldom.py",
>> line 265, in getEvent
>>     self.parser.feed(buf)
>>   File "/usr/lib/python2.2/site-packages/_xmlplus/sax/expatreader.py",
>> line 208, in feed
>>     self._err_handler.fatalError(exc)
>>   File "/usr/lib/python2.2/site-packages/_xmlplus/sax/handler.py",
>> line 38, in fatalError
>>     raise exception
>> xml.sax._exceptions.SAXParseException: <unknown>:553:48: not
>> well-formed (invalid token)
> 
> 
>> def getItems(page):
>>     opener =urllib.URLopener(key_file=HOSTKEY,cert_file=HOSTCERT) ;
>>     try:
>>        xml = opener.open(page)
>>     except:
>>        return []
>>
>>     d = minidom.parseString(xml.read())
>>     items = d.getElementsByTagName('item')
>>     data = []
>>     for i in items:
>>        data.append(getText(i.childNodes))
>>
>>     return data
> 
>     The page is
> https://lcg-voms.cern.ch:8443/voms/cms/services/VOMSCompatibility?method=getGridmapUsers
> and the line with the invalid character is (the invalid character is the
> final é of Université):
> 
> <item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de
> Louvain/CN=Roberfroid</item>
> 
> 
>     I have tried several options but I am not able to avoid this
> problem. Any idea?.

Looks like the page is not well-formed XML (i.e. not XML at all). If it
doesn't specify an encoding (<?xml encoding="..."?>), you can try recoding the
input, possibly decoding it from latin-1 and re-encoding it as UTF-8 before
passing it to the SAX parser.

Alternatively, tell the page authors to fix their page.

Stefan