MSIE6 Python Question

Michael Geary Mike at DeleteThis.Geary.com
Mon May 24 13:17:15 EDT 2004


Ralph A. Gable wrote:
> The data I want is being stripped out when I access the URL
> via urllib. I CAN see the data when I go into IE and do view
> source but when I use urllib the site intentionally blanks out
> the information I want. For that reason, I would like to get it
> using IE6 if I can. If there are other ways to fake out the site,
> I would be interested in that also.

You may be able to get urllib or urllib2 to work using some of the other
tips in this thread, such as the user agent string. Or it may have to do
with cookies, in which case the ClientCookie module may be useful:

http://wwwsearch.sourceforge.net/ClientCookie/

If you do want to use IE, it's really easy. Let's assume you have an ie
object that you've gotten with:

ie = win32com.client.Dispatch( 'InternetExplorer.Application' )

and you've navigated to your URL using ie.Navigate( url ), and you've waited
for Navigate to finish. Then, you can get the document with:

doc = ie.Document

>From there, you can get to anything. If you want the entire HTML source,
it's:

doc.documentElement.outerHTML

Or better yet, you can use the IE object model to let IE do the work of
parsing the HTML for you. For example, suppose the document contains a form
named 'loginForm' with 'username' and 'password' fields, and you want to
fill in those two fields and submit the form. You could do it with:

form = doc.forms.loginForm
form.username = 'myname'
form.password = 'mypassword'
form.submit()

Basically, you can use about the same code you'd use in JavaScript or Visual
Basic inside the web page.

Here's the MSDN reference for the InternetExplorer object:

http://msdn.microsoft.com/workshop/browser/webbrowser/reference/objects/internetexplorer.asp

And here's the reference for the document object:

http://msdn.microsoft.com/workshop/author/dhtml/reference/objects/obj_document.asp

(Sorry about the long URLs; you know what to do.)

One other note: You probably already know about this, but after you do do
the Navigate, you need to wait until IE has loaded the page. You can either
use the NavigateComplete2 event, or it may be easier to cheat a bit and use
a loop with time.sleep() and test the ie.Busy property. I like to wait until
ie.Busy is false and remains false for a couple of seconds, to avoid being
tripped up by redirects where Busy may go false momentarily and then become
true again during the redirect.

-Mike





More information about the Python-list mailing list