HTML Parser chokes on WordHTML...
Harald Massa
cpl.19.ghum at spamgourmet.com
Fri May 2 14:14:53 EDT 2003
HTMLParser failing
I try to parse a HTML-File which was generated by Microsoft Word. Two
bad errors occur:
first, content of an <-- Tag is taken as data:
(
handle_data(self, text) gets the following as text:
<!--
/* Font Definitions */
@font-face
[... and so on]
)
The input-data is:
<style>
<!--
/* Font Definitions */
@font-face
{font-family:Wingdings;
panose-1:5 0 0 0 0 0 0 0 0 0;
mso-font-charset:2;
mso-generic-font-family:auto;
mso-font-pitch:variable;
mso-font-signature:0 268435456 0 0 -2147483648 0;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{mso-style-parent:"";
margin-top:0cm;
margin-right:0cm;
margin-bottom:6.0pt;
margin-left:0cm;
mso-pagination:widow-orphan;
font-size:10.0pt;
ifont-family:Arial;
mso-fareast-font-family:"Times New Roman";
mso-bidi-font-family:"Times New Roman";}
h1
{mso-style-next:Standard;
margin-top:12.0pt;
[... going on for around 1400 Lines ..]
To my understanding no good idea to put the stylesheet inside of the
HTML-File, but rather legal HTML. And a closing --> and </style tag is
also present.
What is going wrong inside HTML-Parser?
The second error is: HTML-Parser excepts with:
Traceback (most recent call last):
File "<input>", line 2, in ?
File "C:\Python22\lib\HTMLParser.py", line 108, in feed
self.goahead(0)
File "C:\Python22\lib\HTMLParser.py", line 158, in goahead
k = self.parse_declaration(i)
File "C:\Python22\lib\markupbase.py", line 66, in parse_declaration
decltype, j = self._scan_name(j, i)
File "C:\Python22\lib\markupbase.py", line 313, in _scan_name
self.error("expected name token")
File "C:\Python22\lib\HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParseError: expected name token, at line 1494, column 29
Line 1494 from the Error is:
<p class=Aufzhlung-Strich><![if !supportLists]><span
style='font-size:8.0pt'>-<span style='font:7.0pt "Times New
Roman"'>  
; </span></span><![endif]>definiert die
Grundzüge der Risikopolitik .... </p>
again, <![if !suportLists]> does not look great, but should be legal
HTMl - should'nt it?
So... is there any replacement for the HTMLParser from the python.lib
which even can eat Microsoft Word HTML ?
Thanks for your consideration,
Harald
More information about the Python-list
mailing list