Something faster then sgmllib for sucking out URLs

damien morton morton at dennisinter.com
Thu Jun 13 00:33:16 EDT 2002


Alex Polite <m2 at plusseven.com> wrote in message news:<mailman.1023913936.26808.python-list at python.org>...
> I'm working on a webspider to fit my sick needs. The profiler
> tells me that about 95% of the time is spent in sgmllib. I use sgmllib
> solely for extracting URLs. I'm looking for a faster way of doing
> this. Regular expressions, string searches? What's the way to go? I'm
> not a python purist. Calling some fast C program with the html as
> argument and getting back a list of URLs would be fine by me.

Ive got a ittle webspider that can max out my cable connection.

You can download it from www.bitfurnace.com/python/crawler.py

Uses regexes to grab anything that looks like a url.



More information about the Python-list mailing list