Open source web crawler with mysql integration

Philip Semanchuk philip at semanchuk.com
Fri Apr 10 19:21:53 EDT 2009


On Apr 10, 2009, at 12:33 PM, bruce wrote:

> phillip...
>
> lots of code is opened source "as is"!!!
>
> when you get right down to it, a good deal of "open source" code from
> sourceforge/hotscritps/freshmeat/etc.. is pretty poor, but it is open
> sourced.
>
> you could simply toss your code out into the open source pool, and  
> not be
> worried about supporting it, or even touching it again...

You're right, and I like your enthusiasm. But I don't want to invite  
people to use my code if it's just going to be frustrating to 90% of  
them. It's bad for my reputation. And for the reputation of open  
source in general, but I'm more concerned about me. ;)




> -----Original Message-----
> From: python-list-bounces+bedouglas=earthlink.net at python.org
> [mailto:python-list-bounces+bedouglas=earthlink.net at python.org]On  
> Behalf
> Of Philip Semanchuk
> Sent: Friday, April 10, 2009 8:10 AM
> To: Python (General)
> Subject: Re: Open source web crawler with mysql integration
>
>
>
> On Apr 10, 2009, at 10:28 AM, Support Desk wrote:
>
>> Sounds Interesting. When its done would you care to share it?
>
> Hi Michael,
> The coding has been done (as much as software is ever "done") for a
> couple of years now. It's mothballed now, sitting on my hard drive.
> The problem with open sourcing it isn't that the code is incomplete,
> the problem is that it's insufficiently documented, features a
> byzantine install procedure and contains a lot of code & assumptions
> that were relevant to my business but would not be of interest to most
> people looking to download a general-purpose spider. I'd love to open
> source it and if someone wants to pay me to make it open source-able,
> let's talk! But if I have to do it on my own time for free it will be
> a while (maybe never, although I hope not) before I can make the time.
>
> Regards
> Philip
>
>
>
>
>> -----Original Message-----
>> From: Philip Semanchuk [mailto:philip at semanchuk.com]
>> Sent: Thursday, April 09, 2009 9:46 PM
>> To: Python
>> Subject: Re: Open source web crawler with mysql integration
>>
>>
>> On Apr 9, 2009, at 7:37 PM, Daniel Fetchinson wrote:
>>
>>>> I'm looking for a crawler that can spider my site and toss the
>>>> results
>>>> into mysql so, in turn, that database can be indexed by Sphinx
>>>> Search.
>>>>
>>>> Since I don't want to reinvent the wheel, is anyone aware of any
>>>> open
>>>> source projects or code snippets that can already handle this?
>>>
>>> Have a look at http://nikitathespider.com/python/
>>
>>
>> As the author of Nikita, I can say that (a) she used Postgres and (b)
>> the code wasn't open sourced except for a couple of small parts. The
>> service is now defunct. It wasn't making money. Ideally I'd like to
>> open source the code one day, but it would take a lot of  
>> documentation
>> work to make it installable by others, and I won't have the time to  
>> do
>> that for the foreseeable future.
>>
>> At the URL provided there's a nice module for parsing robots.txt  
>> files
>> (better than the one in the standard library IMHO) but that's about
>> it.
>>
>> FYI, I wrote my spider in Python because I couldn't find a decent one
>> written in Python. There's Nutch, but that's not Python (Java I
>> think).
>>
>> Good luck
>> Philip
>>
>>
>>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>




More information about the Python-list mailing list