developing web spider
Nikita the Spider
NikitaTheSpider at gmail.com
Fri Apr 4 00:23:16 CEST 2008
In article <47f3ab52$0$36346$742ec2ed at news.sonic.net>,
John Nagle <nagle at animats.com> wrote:
> abeen wrote:
> > Hello,
> > I would want to know which could be the best programming language for
> > developing web spider.
> > More information about the spider, much better,,
> As someone who actually runs a Python based web spider in production, I
> should comment.
> You need a very robust parser to parse real world HTML.
> Even the stock version of BeautifulSoup isn't good enough. We have a
> modified version of BeautifulSoup, plus other library patches, just to
> keep the parser from blowing up or swallowing the entire page into
> a malformed comment or tag. Browsers are incredibly forgiving in this
> "urllib" needs extra robustness, too. The stock timeout mechanism
> isn't good enough. Some sites do weird things, like open TCP connections
> for HTTP but not send anything.
> Python is on the slow side for this. Python is about 60x
> slower than C, and for this application, you definitely see that.
> A Python based spider will go compute bound for seconds per page
> on big pages. The C-based parsers for XML/HTML aren't robust enough for
> this application. And then there's the Global Interpreter Lock; a multicore
> CPU won't help a multithreaded compute-bound process.
> I'd recommend using Java or C# for new work in this area
> if you're doing this in volume. Otherwise, you'll need to buy
> many, many extra racks of servers. In practice, the big spiders
> are in C or C++.
I'll throw in an opinion from a different viewpoint. I'm really happy I
used Python to develop my spider. I like the language, it has a good
library and good community support and 3rd party modules.
John, I don't know what your spider does, but you face some hurdles that
I don't. For instance, since I'm focused on validation, if bizarre
(invalid) HTML makes a page look like garbage, I just report the problem
to the author. Performance isn't a big problem for me, either, since
this is not a crawl-as-fast-as-you-can application.
What you said sounds to me entirely correct for your application. The OP
who asked for as much information as possible didn't give a whole lot to
Whole-site HTML validation, link checking and more
More information about the Python-list