Open source web crawler with mysql integration

Philip Semanchuk philip at semanchuk.com
Fri Apr 10 11:09:58 EDT 2009


On Apr 10, 2009, at 10:28 AM, Support Desk wrote:

> Sounds Interesting. When its done would you care to share it?

Hi Michael,
The coding has been done (as much as software is ever "done") for a  
couple of years now. It's mothballed now, sitting on my hard drive.  
The problem with open sourcing it isn't that the code is incomplete,  
the problem is that it's insufficiently documented, features a  
byzantine install procedure and contains a lot of code & assumptions  
that were relevant to my business but would not be of interest to most  
people looking to download a general-purpose spider. I'd love to open  
source it and if someone wants to pay me to make it open source-able,  
let's talk! But if I have to do it on my own time for free it will be  
a while (maybe never, although I hope not) before I can make the time.

Regards
Philip




> -----Original Message-----
> From: Philip Semanchuk [mailto:philip at semanchuk.com]
> Sent: Thursday, April 09, 2009 9:46 PM
> To: Python
> Subject: Re: Open source web crawler with mysql integration
>
>
> On Apr 9, 2009, at 7:37 PM, Daniel Fetchinson wrote:
>
>>> I'm looking for a crawler that can spider my site and toss the
>>> results
>>> into mysql so, in turn, that database can be indexed by Sphinx
>>> Search.
>>>
>>> Since I don't want to reinvent the wheel, is anyone aware of any  
>>> open
>>> source projects or code snippets that can already handle this?
>>
>> Have a look at http://nikitathespider.com/python/
>
>
> As the author of Nikita, I can say that (a) she used Postgres and (b)
> the code wasn't open sourced except for a couple of small parts. The
> service is now defunct. It wasn't making money. Ideally I'd like to
> open source the code one day, but it would take a lot of documentation
> work to make it installable by others, and I won't have the time to do
> that for the foreseeable future.
>
> At the URL provided there's a nice module for parsing robots.txt files
> (better than the one in the standard library IMHO) but that's about  
> it.
>
> FYI, I wrote my spider in Python because I couldn't find a decent one
> written in Python. There's Nutch, but that's not Python (Java I  
> think).
>
> Good luck
> Philip
>
>
>




More information about the Python-list mailing list