HTML DOM parser?
John J. Lee
jjl at pobox.com
Thu Jul 31 21:45:56 EDT 2003
Paul Rubin <http://phr.cx@NOSPAM.invalid> writes:
> Is there an HTML DOM parser available for Python? Preferably one that
> does a reasonable job with the crappy HTML out there on real web
> pages, that doesn't get upset about unterminated tables and stuff like
> that. Many extra points if it understands Javascript. Application is
> a screen scraping web robot. Thanks.
glork. I just started working on this myself.
Email me if you'd like the code, such as it is. I've wrapped the
Mozilla JS interpreter but am currently stuck on a segfault, so I
could certainly do with a collaborator.
I'm using utidylib and 4DOM (latter from PyXML).
Mind you, if you actually want to get a job done <wink>, for a
quick-but-bulky (and somewhat closed) solution, try PyKDE (KHTML /
KJS) or IE automation (MSHTML / JScript). Mozilla + XPCOM also, but I
think it requires rebuilding Mozilla to get PyXPCOM support. There's
also httpunit (in Java, useable from Jython).
John
More information about the Python-list
mailing list