[Baypiggies] web scraping best practice question

Dennis Reinhardt DennisR at dair.com
Mon Nov 2 22:24:49 CET 2009


At 11:22 AM 11/2/2009, Isaac wrote:
>Does anyone have recommendations for
>best practices regarding rete of sending a set of queries?


I assume this is not your site.  You want to make sure that you only 
need to do this ONCE and do not have to repeat the process because of 
some error:

1) Save the pages you access so that if you need to re-parse, you 
have a local copy ... or you hit an error and need to reacquire.

2) try this out with, say, 25 in the list to make sure there are no 
obvious errors.  Test, test, test.

3) Show the status of what is going on so that you can effectively 
monitor that operation is normal.  If you are not going to monitor, 
consider running this in early hours to lessen impact.

4) As long as you are inserting delays, time them from the successful 
completion of the previous request rather than the initiation.  You 
should avoid creating remote zombies.

5) Simulate some error conditions before going live so that you know 
your logging allows you to go back and get those specific pages manually.

6) Evaluate how much data you are transferring (some web pages are 
very heavy).  You could be cutting heavily into the budgeted transfer 
allowed to the web site and making them incur extra bandwidth [sic] 
charges.  Not a win friends and influence people move.

Dennis 



More information about the Baypiggies mailing list