please critique my thread code
winston at cs.wisc.edu
winston at cs.wisc.edu
Sun Jun 15 15:29:42 CEST 2008
I wrote a Python program (103 lines, below) to download developer data
from SourceForge for research about social networks.
Please critique the code and let me know how to improve it.
An example use of the program:
prompt> python download.py 1 240000
The above command downloads data for the projects with IDs between 1
and 240000, inclusive. As it runs, it prints status messages, with a
plus sign meaning that the project ID exists. Else, it prints a minus
--- Are my setup and use of threads, the queue, and "while True" loop
correct or conventional?
--- Should the program sleep sometimes, to be nice to the SourceForge
servers, and so they don't think this is a denial-of-service attack?
--- Someone told me that popen is not thread-safe, and to use
mechanize. I installed it and followed an example on the web site.
There wasn't a good description of it on the web site, or I didn't
find it. Could someone explain what mechanize does?
--- How do I choose the number of threads? I am using a MacBook Pro
2.4GHz Intel Core 2 Duo with 4 GB 667 MHz DDR2 SDRAM, running OS
# Winston C. Yang
# Created 2008-06-14
from __future__ import with_statement
lock = threading.RLock()
# Make the dot match even a newline.
error_pattern = re.compile(".*\n<!--pageid login -->\n.*", re.DOTALL)
return time.strftime("%Y-%m-%d %H:%M:%S")
id = queue.get()
request = mechanize.Request("http://sourceforge.net/project/"\
response = mechanize.urlopen(request)
text = response.read()
valid_id = not error_pattern.match(text)
f = open("%d.csv" % id, "w+")
print "\t".join((str(id), now(), "+" if valid_id else
print "usage: python application start_id end_id"
print "Get the usernames associated with each SourceForge project
print "ID between start_id and end_id, inclusive."
print "start_id and end_id must be positive integers and satisfy"
print "start_id <= end_id."
if __name__ == "__main__":
if len(sys.argv) == 3:
start_id = int(sys.argv)
if start_id <= 0:
end_id = int(sys.argv)
if end_id < start_id:
# Print the start time.
start_time = now()
# Create a directory whose name contains the start time.
dir = start_time.replace(" ", "_").replace(":", "_")
queue = Queue.Queue(0)
for i in xrange(32):
t = threading.Thread(target=worker, name="worker %d" % (i +
for id in xrange(start_id, end_id + 1):
# When the queue has size zero, exit in three seconds.
if queue.qsize() == 0:
More information about the Python-list