HTML Parser chokes on WordHTML...
JanC
usenet_spam at janc.invalid
Fri May 2 20:38:55 EDT 2003
Harald Massa <cpl.19.ghum at spamgourmet.com> schreef:
> So... is there any replacement for the HTMLParser from the python.lib
> which even can eat Microsoft Word HTML ?
Maybe try to process the Word pseudo-HTML with "HTML Tidy" before you feed
it to HTMLParser?
<http://tidy.sourceforge.net/>
<http://tidy.sourceforge.net/docs/quickref.html#word-2000>
You could wrap tidylib for use inside Python too:
<http://tidy.sourceforge.net/libintro.html>
--
JanC
"Be strict when sending and tolerant when receiving."
RFC 1958 - Architectural Principles of the Internet - section 3.9
More information about the Python-list
mailing list