[Pythonmac-SIG] parsing system_profiler xml output

Bob Ippolito bob at redivi.com
Fri Nov 12 04:01:52 CET 2004


On Nov 11, 2004, at 7:45 PM, brad.allen at omsdal.com wrote:

> eichin at metacarta.com wrote:
>>> (as for the other part, Python's XML story is wierd - when PyXML is
>>> installed it drops some new things into the xml namespace, so having
>>> "import xml" work doesn't actually tell you anything useful.)
>
> Ah, thanks for the clarification.
>
> Bob Ippolito <bob at redivi.com> wrote:
>> import _xmlplus is the correct way to detect the presence of PyXML.
>
> I tried this, which didn't work on my system, so I guess PyXML is not  
> part
> of the Panther Python distribution. I don't really want to fuss with
> distributing this to all our Macs if I can avoid it.

PyXML is not part of any standard distribution, it's a third party  
"extension" to Python's XML support.  The way it integrates with the  
standard library is stupid.

> Bob Ippolito <bob at redivi.com> wrote:
>> If this doesn't work with Python 2.4's plistlib, then the output of
>> system profiler is not correct.  I'm at least 90% sure that Python
>> 2.4's plistlib correctly reads and writes all valid plists.
>
> Well, you may be right. I don't have an XML validator handy at the  
> moment,
> but I tried your updated, bug-fixed plistlib on the full output of
> system_profiler -xml. It failed, but I didn't get the same traceback as
> eichin. Also, plistlib worked beautifully on partial output of
> system_profiler, such as system_profiler -SPNetworkDataType -xml.
>
> Here's what I did:
>
> At the bash prompt, ran: system_profiler -xml >
> /Users/ballen/system_profiler_output.plist
>
> In Python:
>       import pyDesktopConfig.plistlibTest as plistlib
>       plistDict =
> plistlib.readPlist('/Users/ballen/system_profiler_output.plist')
>
>
> ....after which I received the following error:
>
> xml.parsers.expat.ExpatError: not well-formed (invalid token): line  
> 13031,
> column 15

That's actually lower level than the DTD.  Python's XML parser doesn't  
think the XML is well formed *at all*.  Looking at the XML, it seems  
that Apple uses some low-ascii characters like this:

'Nov 11 19:09:50 crack-wlan kernel: \x10ADB present:8c'

expat, the low level parser behind Python's default XML handling  
capabilities, does not like this at all.  I believe this is probably a  
bug in expat.

You can see this if you do:

import xml.dom.minidom
xml.dom.minidom.parseString('<a>\x10</a>')

I'm not 100% sure if this is Apple's or Expat's fault, but you have a  
couple options:
(a) Use Apple's plist parser via PyObjC instead (painfully easy to do  
and WAY faster than plistlib)
(b) Throw away characters expat isn't going to understand, a correct  
implementation is probably close to this:

import re
SKIP = map(ord, u'\t\r\n\f\v')
REFORMAT = re.compile(
     u'(%s)' %
     (u'|'.join([unichr(x) for x in range(1, ord(u' ')) if x not in  
SKIP]),)
)

def reformat(s):
     return REFORMAT.sub(u'', s)

plistString =  
reformat(file('system_profiler.xml').read().decode('utf 
-8')).encode('utf-8')

-bob



More information about the Pythonmac-SIG mailing list