>> I wrote a Python program (103 lines, below) to download developer data
>> from SourceForge for research about social networks.
>> Please critique the code and let me know how to improve it.
>> An example use of the program:
>> prompt> python 1 240000
>> The above command downloads data for the projects with IDs between 1
>> and 240000, inclusive. As it runs, it prints status messages, with a
>> plus sign meaning that the project ID exists. Else, it prints a minus
>> sign.
>> Questions:
>> --- Are my setup and use of threads, the queue, and "while True" loop
>> correct or conventional?
>> --- Should the program sleep sometimes, to be nice to the SourceForge
>> servers, and so they don't think this is a denial-of-service attack?
>> --- Someone told me that popen is not thread-safe, and to use
>> mechanize. I installed it and followed an example on the web site.
>> There wasn't a good description of it on the web site, or I didn't
>> find it. Could someone explain what mechanize does?
>> --- How do I choose the number of threads? I am using a MacBook Pro
>> 2.4GHz Intel Core 2 Duo with 4 GB 667 MHz DDR2 SDRAM, running OS
>> 10.5.3.
>> Thank you.
>> Winston
> String methods are quicker than regular expressions, so don't use
> regular expressions if string methods are perfectly adequate. For
> example, you can replace:


Erm, shurely the bottleneck will be bandwidth not processor/memory?* If 
it isn't then - yes, you run the risk of actually DOSing their servers!

Your mac will run thousands of threads comfortably but your router may 
not handle the thousands of TCP/IP connections you throw at it very 
well, especially if it is a domestic model, and sure as hell sourceforge 
aren't going to want more than a handfull of concurrent connections from 

Typical sourceforge page ~ 30K
Project pages to read = 240000

= ~6.8 Gigabytes

Maybe send their sysadmin a box of chocolates if you want to grab all 
that in any less than a week and not get your IP blocked! :)

Roger Heathcote

* Of course, stylistically, MRAB is perfectly right about not wasting 
CPU on regexes where string methods will do, unless you are planning on 
making your searches more elaborate in the future.

