Python web client anyone?
bill-bell at bill-bell.hamilton.on.ca
Mon Oct 15 13:44:34 CEST 2001
Paul Rubin <phr-n2001d at nightsong.com> wrote, in part:
> ... I was looking for something that actually parses the HTML on
> the retrieved page like LWP does. I wonder if there's some way to
> do that with the XML libraries (though HTML is generally not
> well-formed XML ...
If your platform is MSW then you might consider using MSHTML.
It's the HTML parser+ that's embedded in IE, and it can be
exercised as a COM object. Clearly a product like IE does an
excellent job of parsing broken HTML docs and MSHTML is I
believe freely distributable.
The snag in using MSHTML with Python is that Python is as yet
unable to process vtable-based interfaces (which is really needed
to use MSHTML)--ref Mark Hammond's remarks of several weeks
ago. One way around this problem is to model code on the 'walkall'
example provided on MSDN and wrap it in some way to make what
you want accessible in Python.
I have not investigated what's available for parsing HTML on other
platforms. However, the same general strategy (ie, that of
exercising one of the best available web clients on the platform)
might work in those cases too.
Best of luck,
"It is the time that you have wasted for your rose that makes your rose so important."--St-Exupery
More information about the Python-list