[Web-SIG] SelectORacle

Bill Janssen janssen at parc.com
Tue Nov 25 22:25:48 EST 2003


John,

I'm aware of at least 4 spiders written in the last year by various
research groups at PARC, alone!  Usually, it's part of something
called "focussed crawling", which is examining sites for some
particular purpose.

It's wrong to ask whether there spiders written in Python, I think.  A
more interesting question is, how many times was Python rejected as a
language in which to write crawlers because some other language had a
better library?  I know of at least one such case here in the last year.

Finally, spiders are not the only reason for CSS (in fact, I'd guess
they aren't the main reason.)  The issue is understanding an HTML/XML
page, regardless of where it comes from.  It may be a book in OEBPS
format, for instance, which uses CSS heavily.  CSS parsing is
important for understanding these formats now, and will become
increasingly important as HTML fades out in favor of XHTML and other
XML formats.

> *Are* there any internet search engine spiders written in Python, other
> than Google's?  Independent of the answer to that, though, how many people
> write internet search engine spiders?  Not enough to justify a CSS parser
> in the standard library!

Bill



More information about the Web-SIG mailing list