practical limits of urlopen()
steve at holdenweb.com
Sat Jan 24 18:50:28 CET 2009
> Am I going to have problems if I use urlopen() in a loop to get data
> from 3000+ URLs? There will be about 2KB of data on average at each
> URL. I will probably run the script about twice per day. Data from
> each URL will be saved to my database.
> I'm asking because I've never opened that many URLs before in a loop.
> I'm just wondering if it will be particularly taxing for my server.
> Is it very uncommon to get data from so many URLs in a script? I
> guess search spiders do it, so I should be able to as well?
You shouldn't expect problem - though you might want to think about
using some more advanced technique like threading to get your results
This is Python, though. It shouldn't take long to write a test program
to verify that you can indeed spider 3,000 pages this way.
With about 2KB per page, you could probably build up a memory structure
containing the whole content of every page without memory usage becoming
too excessive for modern systems. If you are writing stuff out to a
database as you go and not retaining page content then there should be
no problems whatsoever.
Then look at a parallelized solution of some sort if you need it to work
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
More information about the Python-list