jparlar at home.com
Mon Aug 20 01:49:36 CEST 2001
For the application that my colleague and I are working on, it is necessary that we be able to take the raw HTML of
some document and pull out just the text, with all tags removed.
So far, we've been using the standard HTMLParser, and it's been doing an *ok* job. In an ideal world, I think we
could keep using it, but the fact is that there's so much garbage HTML out there, it causes some problems. Besides,
HTMLParser is only up to HTML2.0 standards, and while my knowledge of HTML is very limited, I'm know we've
moved past that.
So, we've started playing around with MSHTML.dll, hoping to use that to get the text from a page. Our basic method
so far has been along the lines of the following:
doc = win32com.client.Dispatch ( 'htmlfile' )
plainText = doc.body.innerText
Now, most of the time this works, but sometimes, I don't know why, it goes a little crazy on us. On my personal
machine, it crashes on occasion (I'm not asking for a solution to the crashing, I'd need to give a lot more info to you
for that). However, what really infuriates us are the automated things that IE tends to do, such as checking to see if a
plugin is installed, or trying to create a connection to the internet. It seems that just simply feeding the HTML code
with doc.write() causes IE to do the automated things it often does. Our code has to process hundreds of HTML
documents, and should do it automatically, but will often stop, with a download window popping up asking if we'd like
to download some plugin.
My question: Is there any method to supress all the other IE stuff so I can essentially use MSHTML as a pure
replacement to HTMLParser?
Software Engineering III
Hamilton, Ontario, Canada
"Though there are many paths
At the foot of the mountain
All those who reach the top
See the same moon."
More information about the Python-list