HTML Parser chokes on WordHTML...

Harald Massa cpl.19.ghum at spamgourmet.com
Fri May 2 14:14:53 EDT 2003


HTMLParser failing


I try to parse a HTML-File which was generated by Microsoft Word. Two
bad errors occur: 
    
first, content of an <-- Tag is taken as data:
(
handle_data(self, text) gets the following as text:
<!--
 /* Font Definitions */
@font-face
[... and so on]
)

The input-data is:

<style>
<!--
 /* Font Definitions */
@font-face
	{font-family:Wingdings;
	panose-1:5 0 0 0 0 0 0 0 0 0;
	mso-font-charset:2;
	mso-generic-font-family:auto;
	mso-font-pitch:variable;
	mso-font-signature:0 268435456 0 0 -2147483648 0;}
 /* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{mso-style-parent:"";
	margin-top:0cm;
	margin-right:0cm;
	margin-bottom:6.0pt;
	margin-left:0cm;
	mso-pagination:widow-orphan;
	font-size:10.0pt;
	ifont-family:Arial;
	mso-fareast-font-family:"Times New Roman";
	mso-bidi-font-family:"Times New Roman";}
h1
	{mso-style-next:Standard;
	margin-top:12.0pt;

[... going on for around 1400 Lines ..]
    
To my understanding no good idea to put the stylesheet inside of the
HTML-File, but rather legal HTML. And a closing --> and </style tag is
also present. 

What is going wrong inside HTML-Parser?

The second error is: HTML-Parser excepts with:

Traceback (most recent call last):
  File "<input>", line 2, in ?
  File "C:\Python22\lib\HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "C:\Python22\lib\HTMLParser.py", line 158, in goahead
    k = self.parse_declaration(i)
  File "C:\Python22\lib\markupbase.py", line 66, in parse_declaration
    decltype, j = self._scan_name(j, i)
  File "C:\Python22\lib\markupbase.py", line 313, in _scan_name
    self.error("expected name token")
  File "C:\Python22\lib\HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos())
HTMLParseError: expected name token, at line 1494, column 29


Line 1494 from the Error is:

<p class=Aufzhlung-Strich><![if !supportLists]><span
style='font-size:8.0pt'>-<span style='font:7.0pt "Times New
Roman"'>          &nbsp
;      </span></span><![endif]>definiert die
Grundzüge der Risikopolitik .... </p> 

again, <![if !suportLists]> does not look great, but should be legal
HTMl - should'nt it? 
    

So... is there any replacement for the HTMLParser from the python.lib
which even can eat Microsoft Word HTML ? 
    
Thanks for your consideration,
    
Harald




More information about the Python-list mailing list