HTML Parser chokes on WordHTML...

Fri May 2 18:28:29 EDT 2003

Harald Massa <cpl.19.ghum at spamgourmet.com> wrote:

> first, content of an <-- Tag is taken as data:

I'm not an expert on Python's HTMLParser, but this behaviour is correct. The
<style> element is defined by DTD HTML 3.2 and later as containing CDATA -
that is, text without any characters having special markup meaning. (Up until
the next occurance of '</', anyway.)

The use of pseudo-comments inside <style> and <script> blocks is a hack for
pre-HTML 3.2 browsers. It won't work in XHTML, where elements' contents
are not implicitly CDATA - in a real XHTML implementation that stylesheet will
actually be a comment, and will have no effect.

> To my understanding no good idea to put the stylesheet inside of the
> HTML-File, but rather legal HTML.

Indeed, it may be better authoring practice in general to use linked
stylesheets, but embedded stylesheets are perfectly legit.

> again, <![if !suportLists]> does not look great, but should be legal
> HTMl - should'nt it? 

Nope, it's complete MS-nonsense I'm afraid. It looks a bit like an SGML
marked section (used for conditional inclusion), but isn't one. I doubt
HTMLParser would cope with marked sections anyway, it's not a feature
supported by any mainstream web browser.

> So... is there any replacement for the HTMLParser from the python.lib
> which even can eat Microsoft Word HTML ? 

Not as far as I'm aware. But you could subclass HTMLParser and implement a
handler for it, something like:

  class MSHTMLParser(HTMLParser):
      def parse_declaration(self, i):
          if self.rawdata[i:i+3] == '<![':
              j= self.rawdata.find(']>', i+3)
              if j==-1:
                  return -1
              return j+2
          return HTMLParser.parse_declaration(self, i)

This is untested but might work. It would just throw all <![quatsch]>
declarations away. It won't parse real X[HT]ML <![CDATA[...]]> sections
properly, but I don't think Word produces any.

-- 
Andrew Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/