HTML Parser chokes on WordHTML...
Andrew Clover
and-google at doxdesk.com
Fri May 2 18:28:29 EDT 2003
Harald Massa <cpl.19.ghum at spamgourmet.com> wrote:
> first, content of an <-- Tag is taken as data:
I'm not an expert on Python's HTMLParser, but this behaviour is correct. The
<style> element is defined by DTD HTML 3.2 and later as containing CDATA -
that is, text without any characters having special markup meaning. (Up until
the next occurance of '</', anyway.)
The use of pseudo-comments inside <style> and <script> blocks is a hack for
pre-HTML 3.2 browsers. It won't work in XHTML, where elements' contents
are not implicitly CDATA - in a real XHTML implementation that stylesheet will
actually be a comment, and will have no effect.
> To my understanding no good idea to put the stylesheet inside of the
> HTML-File, but rather legal HTML.
Indeed, it may be better authoring practice in general to use linked
stylesheets, but embedded stylesheets are perfectly legit.
> again, <![if !suportLists]> does not look great, but should be legal
> HTMl - should'nt it?
Nope, it's complete MS-nonsense I'm afraid. It looks a bit like an SGML
marked section (used for conditional inclusion), but isn't one. I doubt
HTMLParser would cope with marked sections anyway, it's not a feature
supported by any mainstream web browser.
> So... is there any replacement for the HTMLParser from the python.lib
> which even can eat Microsoft Word HTML ?
Not as far as I'm aware. But you could subclass HTMLParser and implement a
handler for it, something like:
class MSHTMLParser(HTMLParser):
def parse_declaration(self, i):
if self.rawdata[i:i+3] == '<![':
j= self.rawdata.find(']>', i+3)
if j==-1:
return -1
return j+2
return HTMLParser.parse_declaration(self, i)
This is untested but might work. It would just throw all <![quatsch]>
declarations away. It won't parse real X[HT]ML <![CDATA[...]]> sections
properly, but I don't think Word produces any.
--
Andrew Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/
More information about the Python-list
mailing list