Jay Parlar jparlar at
Mon Aug 20 01:49:36 CEST 2001

For the application that my colleague and I are working on, it is necessary that we be able to take the raw HTML of 
some document and pull out just the text, with all tags removed. 

So far, we've been using the standard HTMLParser, and it's been doing an *ok* job. In an ideal world, I think we 
could keep using it, but the fact is that there's so much garbage HTML out there, it causes some problems. Besides, 
HTMLParser is only up to HTML2.0 standards, and while my knowledge of HTML is very limited, I'm know we've 
moved past that.

So, we've started playing around with MSHTML.dll, hoping to use that to get the text from a page. Our basic method 
so far has been along the lines of the following:

import win32com.client
doc = win32com.client.Dispatch ( 'htmlfile' )
plainText = doc.body.innerText

Now, most of the time this works, but sometimes, I don't know why, it goes a little crazy on us. On my personal 
machine, it crashes on occasion (I'm not asking for a solution to the crashing, I'd need to give a lot more info to you 
for that). However, what really infuriates us are the automated things that IE tends to do, such as checking to see if a 
plugin is installed, or trying to create a connection to the internet. It seems that just simply feeding the HTML code 
with doc.write() causes IE to do the automated things it often does. Our code has to process hundreds of HTML 
documents, and should do it automatically, but will often stop, with a download window popping up asking if we'd like 
to download some plugin.

My question: Is there any method to supress all the other IE stuff so I can essentially use MSHTML as a pure 
replacement to HTMLParser?

Jay Parlar
Software Engineering III
McMaster University
Hamilton, Ontario, Canada

"Though there are many paths
At the foot of the mountain
All those who reach the top
See the same moon."

More information about the Python-list mailing list