please critique my thread code
pfreixes at gmail.com
Sun Jun 15 16:22:22 CEST 2008
The main while in main thread spend all cpu time, it's more convenient put
one little sleep between each iteration or use a some synchronization method
And about your questions IMO:
> --- Are my setup and use of threads, the queue, and "while True" loop
> correct or conventional?
May be, exist another possibility but this it's good, another question is
if iterate arround the 240000 numbers it's the more efficient form for
retrieve all projects.
--- Should the program sleep sometimes, to be nice to the SourceForge
> servers, and so they don't think this is a denial-of-service attack?
You are limiting your number of connections whit you concurrent threads, i
don't believe that SourceForge have a less capacity for request you
> --- Someone told me that popen is not thread-safe, and to use
> mechanize. I installed it and followed an example on the web site.
> There wasn't a good description of it on the web site, or I didn't
> find it. Could someone explain what mechanize does?
I don't know , but if you don't sure you can use urllib2.
> --- How do I choose the number of threads? I am using a MacBook Pro
> 2.4GHz Intel Core 2 Duo with 4 GB 667 MHz DDR2 SDRAM, running OS
For default phtreads in linux flavor spend 8MB for thread stack, i dont know
in you MacBook. i think between 64 to 128 threads it's correct.
> Thank you.
> #!/usr/bin/env python
> # Winston C. Yang
> # Created 2008-06-14
> from __future__ import with_statement
> import mechanize
> import os
> import Queue
> import re
> import sys
> import threading
> import time
> lock = threading.RLock()
> # Make the dot match even a newline.
> error_pattern = re.compile(".*\n<!--pageid login -->\n.*", re.DOTALL)
> def now():
> return time.strftime("%Y-%m-%d %H:%M:%S")
> def worker():
> while True:
> id = queue.get()
> except Queue.Empty:
> request = mechanize.Request("http://sourceforge.net/project/"\
> "memberlist.php?group_id=%d" %
> response = mechanize.urlopen(request)
> text = response.read()
> valid_id = not error_pattern.match(text)
> if valid_id:
> f = open("%d.csv" % id, "w+")
> with lock:
> print "\t".join((str(id), now(), "+" if valid_id else
> def fatal_error():
> print "usage: python application start_id end_id"
> print "Get the usernames associated with each SourceForge project
> print "ID between start_id and end_id, inclusive."
> print "start_id and end_id must be positive integers and satisfy"
> print "start_id <= end_id."
> if __name__ == "__main__":
> if len(sys.argv) == 3:
> start_id = int(sys.argv)
> if start_id <= 0:
> raise Exception
> end_id = int(sys.argv)
> if end_id < start_id:
> raise Exception
> # Print the start time.
> start_time = now()
> print start_time
> # Create a directory whose name contains the start time.
> dir = start_time.replace(" ", "_").replace(":", "_")
> queue = Queue.Queue(0)
> for i in xrange(32):
> t = threading.Thread(target=worker, name="worker %d" % (i +
> for id in xrange(start_id, end_id + 1):
> # When the queue has size zero, exit in three seconds.
> while True:
> if queue.qsize() == 0:
> print now()
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Python-list