practical limits of urlopen()

Sat Jan 24 12:50:28 EST 2009

webcomm wrote:
> Hi,
> 
> Am I going to have problems if I use urlopen() in a loop to get data
> from 3000+ URLs?  There will be about 2KB of data on average at each
> URL.  I will probably run the script about twice per day.  Data from
> each URL will be saved to my database.
> 
> I'm asking because I've never opened that many URLs before in a loop.
> I'm just wondering if it will be particularly taxing for my server.
> Is it very uncommon to get data from so many URLs in a script?  I
> guess search spiders do it, so I should be able to as well?
> 
You shouldn't expect problem - though you might want to think about
using some more advanced technique like threading to get your results
more quickly.

This is Python, though. It shouldn't take long to write a test program
to verify that you can indeed spider 3,000 pages this way.

With about 2KB per page, you could probably build up a memory structure
containing the whole content of every page without memory usage becoming
too excessive for modern systems. If you are writing stuff out to a
database as you go and not retaining page content then there should be
no problems whatsoever.

Then look at a parallelized solution of some sort if you need it to work
more quickly.

regards
 Steve
-- 
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC              http://www.holdenweb.com/