Python web client anyone?

Bill Bell bill-bell at bill-bell.hamilton.on.ca
Mon Oct 15 07:44:34 EDT 2001


Paul Rubin <phr-n2001d at nightsong.com> wrote, in part:
> ... I was looking for something that actually parses the HTML on
> the retrieved page like LWP does.  I wonder if there's some way to
> do that with the XML libraries (though HTML is generally not
> well-formed XML ... 

Paul,

If your platform is MSW then you might consider using MSHTML. 
It's the HTML parser+ that's embedded in IE, and it can be 
exercised as a COM object. Clearly a product like IE does an 
excellent job of parsing broken HTML docs and MSHTML is I 
believe freely distributable.

The snag in using MSHTML with Python is that Python is as yet 
unable to process vtable-based interfaces (which is really needed 
to use MSHTML)--ref Mark Hammond's remarks of several weeks 
ago. One way around this problem is to model code on the 'walkall' 
example provided on MSDN and wrap it in some way to make what 
you want accessible in Python.

I have not investigated what's available for parsing HTML on other 
platforms. However, the same general strategy (ie, that of 
exercising one of the best available web clients on the platform) 
might work in those cases too.

Best of luck,

Bill
"It is the time that you have wasted for your rose that makes your rose so important."--St-Exupery




More information about the Python-list mailing list