kushal.kumaran at gmail.com
Mon Apr 13 14:41:54 CEST 2009
On Mon, Apr 13, 2009 at 11:13 AM, larryzhang <zhangle2002 at gmail.com> wrote:
> Being a newbie for Python, I am trying to write a code that can act as
> a downloading robot.
This might be useful: http://wwwsearch.sourceforge.net/mechanize/.
I've only casually gone through the page, not actually used it. If
you feel like it, you can also use the urllib2 in the library to do
all the work yourself. Notes if you go this way are below.
> The website provides information for companies. Manually, I can search
> by company name and then click the “download” button to get the data
> in excel or word format, before saving the file in a local directory.
> The program is to do this automatically.
> I have met several problems when writing the codes:
> 1. The website needs user ID and password, is there a way that I can
> pass my ID and password to the server in my python code?
See the examples in the urllib2 documentation for how to send a
username and password for Basic authentication. If the authentication
is done using forms, you'll need to put that data with your request.
need to be prepared to handle that.
> 2. Can Python hit the “download” button automatically and choose the
> type of file format as I can do manually?
The download button will probably be just an appropriate GET or POST
request. You'll need to be familiar with HTML forms to be able to do
> 3. The url of each downloading webpage is not unique (webpages point
> to different data files may share the same url), which prevent me from
> working directly with the url as the address to find a certain file.
> Is there any solution for this? Does this mean I have to work directly
> with the database stored in the server rather than with the webpage
This simply means that the identifiers for the file to download are
being passed in using means other than the URL, most likely as POST
data. Look at the HTML for the page to see how.
> Thank you very much for any comments and suggestions.
You'll find tools that let you observe the communication between your
browser and the web server useful. If you use Mozilla Firefox, the
httpfox extension might help.
More information about the Python-list