Open source web crawler with mysql integration
philip at semanchuk.com
Sat Apr 11 01:21:53 CEST 2009
On Apr 10, 2009, at 12:33 PM, bruce wrote:
> lots of code is opened source "as is"!!!
> when you get right down to it, a good deal of "open source" code from
> sourceforge/hotscritps/freshmeat/etc.. is pretty poor, but it is open
> you could simply toss your code out into the open source pool, and
> not be
> worried about supporting it, or even touching it again...
You're right, and I like your enthusiasm. But I don't want to invite
people to use my code if it's just going to be frustrating to 90% of
them. It's bad for my reputation. And for the reputation of open
source in general, but I'm more concerned about me. ;)
> -----Original Message-----
> From: python-list-bounces+bedouglas=earthlink.net at python.org
> [mailto:python-list-bounces+bedouglas=earthlink.net at python.org]On
> Of Philip Semanchuk
> Sent: Friday, April 10, 2009 8:10 AM
> To: Python (General)
> Subject: Re: Open source web crawler with mysql integration
> On Apr 10, 2009, at 10:28 AM, Support Desk wrote:
>> Sounds Interesting. When its done would you care to share it?
> Hi Michael,
> The coding has been done (as much as software is ever "done") for a
> couple of years now. It's mothballed now, sitting on my hard drive.
> The problem with open sourcing it isn't that the code is incomplete,
> the problem is that it's insufficiently documented, features a
> byzantine install procedure and contains a lot of code & assumptions
> that were relevant to my business but would not be of interest to most
> people looking to download a general-purpose spider. I'd love to open
> source it and if someone wants to pay me to make it open source-able,
> let's talk! But if I have to do it on my own time for free it will be
> a while (maybe never, although I hope not) before I can make the time.
>> -----Original Message-----
>> From: Philip Semanchuk [mailto:philip at semanchuk.com]
>> Sent: Thursday, April 09, 2009 9:46 PM
>> To: Python
>> Subject: Re: Open source web crawler with mysql integration
>> On Apr 9, 2009, at 7:37 PM, Daniel Fetchinson wrote:
>>>> I'm looking for a crawler that can spider my site and toss the
>>>> into mysql so, in turn, that database can be indexed by Sphinx
>>>> Since I don't want to reinvent the wheel, is anyone aware of any
>>>> source projects or code snippets that can already handle this?
>>> Have a look at http://nikitathespider.com/python/
>> As the author of Nikita, I can say that (a) she used Postgres and (b)
>> the code wasn't open sourced except for a couple of small parts. The
>> service is now defunct. It wasn't making money. Ideally I'd like to
>> open source the code one day, but it would take a lot of
>> work to make it installable by others, and I won't have the time to
>> that for the foreseeable future.
>> At the URL provided there's a nice module for parsing robots.txt
>> (better than the one in the standard library IMHO) but that's about
>> FYI, I wrote my spider in Python because I couldn't find a decent one
>> written in Python. There's Nutch, but that's not Python (Java I
>> Good luck
More information about the Python-list