developing web spider

Nikita the Spider NikitaTheSpider at
Fri Apr 4 00:23:16 CEST 2008

In article <47f3ab52$0$36346$742ec2ed at>,
 John Nagle <nagle at> wrote:

> abeen wrote:
> > Hello,
> > 
> > I would want to know which could be the best programming language for
> > developing web spider.
> > More information about the spider, much better,,
>     As someone who actually runs a Python based web spider in production, I
> should comment.
>     You need a very robust parser to parse real world HTML.
> Even the stock version of BeautifulSoup isn't good enough.  We have a
> modified version of BeautifulSoup, plus other library patches, just to
> keep the parser from blowing up or swallowing the entire page into
> a malformed comment or tag.  Browsers are incredibly forgiving in this
> regard.
>     "urllib" needs extra robustness, too.  The stock timeout mechanism
> isn't good enough.  Some sites do weird things, like open TCP connections
> for HTTP but not send anything.
>     Python is on the slow side for this.  Python is about 60x
> slower than C, and for this application, you definitely see that.
> A Python based spider will go compute bound for seconds per page
> on big pages.  The C-based parsers for XML/HTML aren't robust enough for
> this application.  And then there's the Global Interpreter Lock; a multicore
> CPU won't help a multithreaded compute-bound process.
>     I'd recommend using Java or C# for new work in this area
> if you're doing this in volume.  Otherwise, you'll need to buy
> many, many extra racks of servers.  In practice, the big spiders
> are in C or C++.

I'll throw in an opinion from a different viewpoint. I'm really happy I 
used Python to develop my spider. I like the language, it has a good 
library and good community support and 3rd party modules. 

John, I don't know what your spider does, but you face some hurdles that 
I don't. For instance, since I'm focused on validation, if bizarre 
(invalid) HTML makes a page look like garbage, I just report the problem 
to the author. Performance isn't a big problem for me, either, since 
this is not a crawl-as-fast-as-you-can application. 

What you said sounds to me entirely correct for your application. The OP 
who asked for as much information as possible didn't give a whole lot to 
start with.

Whole-site HTML validation, link checking and more

More information about the Python-list mailing list