web crawler help?
Steve Holden
sholden at holdenweb.com
Mon Sep 9 16:07:34 EDT 2002
"Carl" <kingprad at mail.com> wrote in message
news:668b76ce.0209090652.4b074a1b at posting.google.com...
> I just wanted to say also that a lot of sites have a robots.txt file
> in the root directory with a list of pages the crawler shouldn't troll
> through. it's polite to honor it if you're grabbing tons of pages from
> a server. Probably fine to ignore if you're not using a lot of server
> time and only doing a few simple tasks.
Good point. I didn't mention it because I know that webchecker (with which I
am much more familiar) does honour the robots.txt file, while for some
reason (probably completeness?) websucker doesn't.
regards
-----------------------------------------------------------------------
Steve Holden http://www.holdenweb.com/
Python Web Programming http://pydish.holdenweb.com/pwp/
Previous .sig file retired to www.homeforoldsigs.com
-----------------------------------------------------------------------
More information about the Python-list
mailing list