[Baypiggies] web scraping best practice question
DennisR at dair.com
Mon Nov 2 22:24:49 CET 2009
At 11:22 AM 11/2/2009, Isaac wrote:
>Does anyone have recommendations for
>best practices regarding rete of sending a set of queries?
I assume this is not your site. You want to make sure that you only
need to do this ONCE and do not have to repeat the process because of
1) Save the pages you access so that if you need to re-parse, you
have a local copy ... or you hit an error and need to reacquire.
2) try this out with, say, 25 in the list to make sure there are no
obvious errors. Test, test, test.
3) Show the status of what is going on so that you can effectively
monitor that operation is normal. If you are not going to monitor,
consider running this in early hours to lessen impact.
4) As long as you are inserting delays, time them from the successful
completion of the previous request rather than the initiation. You
should avoid creating remote zombies.
5) Simulate some error conditions before going live so that you know
your logging allows you to go back and get those specific pages manually.
6) Evaluate how much data you are transferring (some web pages are
very heavy). You could be cutting heavily into the budgeted transfer
allowed to the web site and making them incur extra bandwidth [sic]
charges. Not a win friends and influence people move.
More information about the Baypiggies