Myth or Urban Legend? Python => Google [ was: Why learn Python ??]
EP
EP at zomething.com
Tue Jan 13 21:38:45 EST 2004
Is it true that the original Google spider was written in Python?
I came across a paper on the web some time back that I saved and read just
last night:
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Sergey Brin and Lawrence Page
{sergey, page}@cs.stanford.edu
Computer Science Department, Stanford University, Stanford, CA 94305
A neat read, but I'm not sure of the authenticity of the paper: I could be
gullible. It would appear to be a paper written some years back on the
genesis of the Google search engine.
[excerpt]
Running a web crawler is a challenging task. There are tricky performance
and reliability issues and even more importantly, there are social issues.
Crawling is the most fragile application since it involves interacting with
hundreds of thousands of web servers and various name servers which are all
beyond the control of the system.
In order to scale to hundreds of millions of web pages, Google has a fast
distributed crawling system. A single URLserver serves lists of URLs to a
number of crawlers (we typically ran about 3). Both the URLserver and the
crawlers are implemented in Python. Each crawler keeps roughly 300
connections open at once. This is necessary to retrieve web pages at a fast
enough pace. At peak speeds, the system can crawl over 100 web pages per
second using four crawlers. This amounts to roughly 600K per second of
data. A major performance stress is DNS lookup. Each crawler maintains a
its own DNS cache so it does not need to do a DNS lookup before crawling
each document. Each of the hundreds of connections can be in a number of
different states: looking up DNS, connecting to host, sending request, and
receiving response. These factors make the crawler a complex component of
the system. It uses asynchronous IO to manage events, and a number of
queues to move page fetches from state to state.
[/excerpt]
It would seem like the poster boy example for using Python in some
respects, if true.
Eric, Intrigued
"but at least I didn't top post"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20040113/c0789a8e/attachment.html>
More information about the Python-list
mailing list