Web Crawler - Python or Perl?

Sebastian "lunar" Wiesner basti.wiesner at gmx.net
Mon Jun 9 22:06:23 CEST 2008


 subeen <tamim.shahriar at gmail.com> at Montag 09 Juni 2008 20:21:

> On Jun 10, 12:15 am, Stefan Behnel <stefan... at behnel.de> wrote:
>> subeen wrote:
>> > can use urllib2 module and/or beautiful soup for developing crawler
>>
>> Not if you care about a) speed and/or b) memory efficiency.
>>
>> http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
>>
>> Stefan
> 
> ya, beautiful soup is slower. so it's better to use urllib2 for
> fetching data and regular expressions for parsing data.

BeautifulSoup is implemented on regular expressions.  I doubt, that you can
achieve a great performance gain by using plain regular expressions, and
even if, this gain is certainly not worth the effort.  Parsing markup with
regular expressions is hard, and the result will most likely not be as fast
and as memory-efficient as lxml.html.

I personally am absolutely happy with lxml.html.  It's fast, memory
efficient, yet powerful and easy to use.

-- 
Freedom is always the freedom of dissenters.
                                      (Rosa Luxemburg)



More information about the Python-list mailing list