Web Crawler - Python or Perl?
Ray Cote
rgacote at AppropriateSolutions.com
Mon Jun 9 15:48:41 EDT 2008
At 11:21 AM -0700 6/9/08, subeen wrote:
>On Jun 10, 12:15 am, Stefan Behnel <stefan... at behnel.de> wrote:
>> subeen wrote:
>> > can use urllib2 module and/or beautiful soup for developing crawler
>>
>> Not if you care about a) speed and/or b) memory efficiency.
>>
> > http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
>>
>> Stefan
>
>ya, beautiful soup is slower. so it's better to use urllib2 for
>fetching data and regular expressions for parsing data.
>
>
>regards,
>Subeen.
>http://love-python.blogspot.com/
>--
>http://mail.python.org/mailman/listinfo/python-list
Beautiful Soup is a bit slower, but it will actually parse some of
the bizarre HTML you'll download off the web. We've written a couple
of crawlers to run over specific clients sites (I note, we did _not_
create the content on these sites).
Expect to find html code that looks like this:
<ul>
<li>
<form>
</li>
</form>
</ul>
[from a real example, and yes, it did indeed render in IE.]
I don't know if some of the quicker parsers discussed require
well-formed HTML since I've not used them. You may want to consider
using one of the quicker HTML parsers and, when they throw a fit on
the downloaded HTML, drop back to Beautiful Soup -- which usually
gets _something_ useful off the page.
--Ray
--
Raymond Cote
Appropriate Solutions, Inc.
PO Box 458 ~ Peterborough, NH 03458-0458
Phone: 603.924.6079 ~ Fax: 603.924.8668
rgacote(at)AppropriateSolutions.com
www.AppropriateSolutions.com
More information about the Python-list
mailing list