HTML Parser chokes on WordHTML...

Fri May 2 14:14:53 EDT 2003

HTMLParser failing

I try to parse a HTML-File which was generated by Microsoft Word. Two
bad errors occur: 

first, content of an <-- Tag is taken as data:
(
handle_data(self, text) gets the following as text:
<!--
 /* Font Definitions */
@font-face
[... and so on]
)

The input-data is:

<style>
<!--
 /* Font Definitions */
@font-face
	{font-family:Wingdings;
	panose-1:5 0 0 0 0 0 0 0 0 0;
	mso-font-charset:2;
	mso-generic-font-family:auto;
	mso-font-pitch:variable;
	mso-font-signature:0 268435456 0 0 -2147483648 0;}
 /* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{mso-style-parent:"";
	margin-top:0cm;
	margin-right:0cm;
	margin-bottom:6.0pt;
	margin-left:0cm;
	mso-pagination:widow-orphan;
	font-size:10.0pt;
	ifont-family:Arial;
	mso-fareast-font-family:"Times New Roman";
	mso-bidi-font-family:"Times New Roman";}
h1
	{mso-style-next:Standard;
	margin-top:12.0pt;

[... going on for around 1400 Lines ..]

To my understanding no good idea to put the stylesheet inside of the
HTML-File, but rather legal HTML. And a closing --> and </style tag is
also present.

What is going wrong inside HTML-Parser?

The second error is: HTML-Parser excepts with:

Traceback (most recent call last):
  File "<input>", line 2, in ?
  File "C:\Python22\lib\HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "C:\Python22\lib\HTMLParser.py", line 158, in goahead
    k = self.parse_declaration(i)
  File "C:\Python22\lib\markupbase.py", line 66, in parse_declaration
    decltype, j = self._scan_name(j, i)
  File "C:\Python22\lib\markupbase.py", line 313, in _scan_name
    self.error("expected name token")
  File "C:\Python22\lib\HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos())
HTMLParseError: expected name token, at line 1494, column 29

Line 1494 from the Error is:

<![if !supportLists]>-          &nbsp
;      <![endif]>definiert die
Grundzüge der Risikopolitik ....

again, <![if !suportLists]> does not look great, but should be legal
HTMl - should'nt it?

So... is there any replacement for the HTMLParser from the python.lib
which even can eat Microsoft Word HTML ? 

Thanks for your consideration,

Harald