George Sakkis george.sakkis at gmail.com
Mon Jun 9 22:37:07 CEST 2008

> Hi all,
> I am currently planning to write my own web crawler. I know Python but
> not Perl, and I am interested in knowing which of these two are a
> better choice given the following scenario:
> 1) I/O issues: my biggest constraint in terms of resource will be
> bandwidth throttle neck.
> 2) Efficiency issues: The crawlers have to be fast, robust and as
> "memory efficient" as possible. I am running all of my crawlers on
> cheap pcs with about 500 mb RAM and P3 to P4 processors
> 3) Compatibility issues: Most of these crawlers will run on Unix
> (FreeBSD), so there should exist a pretty good compiler that can
> optimize my code these under the environments.
> What are your opinions?

You mentioned *what* you want but not *why*. If it's for a real-world
production project, why reinvent a square wheel and not use (or at
least extend) an existing open source crawler, with years of
development behind it ? If it's a learning exercise, why bother about
performance so early ?

In any case, since you said you know python but not perl, the choice
is almost a no-brainer, unless you're looking for an excuse to learn
perl. In terms of performance they are comparable, and you can
probably manage crawls in the order of 10-100K pages at best. For
million-page or larger crawls  though, you'll have to resort to C/C++
sooner or later.


