web crawler help?

Steve Holden sholden at holdenweb.com
Mon Sep 9 22:07:34 CEST 2002


"Carl" <kingprad at mail.com> wrote in message
news:668b76ce.0209090652.4b074a1b at posting.google.com...
> I just wanted to say also that a lot of sites have a robots.txt file
> in the root directory with a list of pages the crawler shouldn't troll
> through. it's polite to honor it if you're grabbing tons of pages from
> a server. Probably fine to ignore if you're not using a lot of server
> time and only doing a few simple tasks.

Good point. I didn't mention it because I know that webchecker (with which I
am much more familiar) does honour the robots.txt file, while for some
reason (probably completeness?) websucker doesn't.

regards
-----------------------------------------------------------------------
Steve Holden                                  http://www.holdenweb.com/
Python Web Programming                 http://pydish.holdenweb.com/pwp/
Previous .sig file retired to                    www.homeforoldsigs.com
-----------------------------------------------------------------------






More information about the Python-list mailing list