urllib (54, 'Connection reset by peer') error

Sat Jun 21 12:51:31 EDT 2008

Tim Golden wrote:
> chrispoliquin at gmail.com wrote:
>> Thanks for the help.  The error handling worked to a certain extent
>> but after a while the server does seem to stop responding to my
>> requests.
>>
>> I have a list of about 7,000 links to pages I want to parse the HTML
>> of (it's basically a web crawler) but after a certain number of
>> urlretrieve() or urlopen() calls the server just stops responding.
>> Anyone know of a way to get around this?  I don't own the server so I
>> can't make any modifications on that side.
> 
> I think someone's already mentioned this, but it's almost
> certainly an explicit or implicit throttling on the remote server.
> If you're pulling 7,000 pages from a single server you need to
> be sure that you're within the Terms of Use of that service, or
> at the least you need to contact the maintainers in courtesy to
> confirm that this is acceptable.
> 
> If you don't you may well cause your IP block to be banned on
> their network, which could affect others as well as yourself.

    Interestingly, "lp.findlaw.com" doesn't have any visible terms of service.
The information being downloaded is case law, which is public domain, so
there's no copyright issue.  Some throttling and retry is needed to slow
down the process, but it should be fixable.

    Try this: put in the retry code someone else suggested.  Use a variable
retry delay, and wait one retry delay between downloading files.  Whenever
a download fails, double the retry delay and try
again; don't let it get bigger than, say, 256 seconds.  When a download
succeeds, halve the retry delay, but don't let it get smaller than 1 second.
That will make your downloader self-tune to the throttling imposed by
the server.

				John Nagle