[BangPypers] Announcing HarvestMan 2.0 and Hget 1.0 alpha

Thu Feb 7 13:00:09 CET 2008

Hi,

  HarvestMan has been under development since Jul 2003. However the
last time a public release was made was in Sep 2005. Now after a gap of
more than two years, I am announcing the initial release (alpha) of the
version 2.0 of HarvestMan and the companion program Hget.

The version 2.0 is under development still and a lot of things will change
down the line. I have been thinking of making a final announcement after
everything is done; however it looks like it will take a long time for the
complete work to be done, so I have decide to make intermediate alpha
and beta releases, till the final version is ready.

There are lots of changes in HarvestMan, the main change being
a new plugin feature which allows to modify program behaviour by writing
small pieces of Python code as plugins (say, akin to Firefox extensions).
As of now, plugins exist for integration with Lucene, Swish-e. (As of this
writing, HarvestMan + plugins is currently being used by students in a
University in Europe to write custom web crawling applications.)

The changes are not completed yet. The program is still a single process.
I will be changing this to first a client/server split and then to a p2p
architecture for better scaling, as development progresses.

The highlight is actually another application named "Hget" which is
built on top of HarvestMan as a framework. Hget can be considered
as wget on steroids, and can be used as a download manager to
perform HTTP downloads in pieces from the web. It can perform
HTTP Multipart downloading, mirror search and download, HTTP
resuming, failover and has built-in support for sourceforge.net mirrors.
More features are getting added daily.

Hget and HarvestMan are packaged together. The URL is

http://www.harvestmanontheweb.com/packages/2.0/HarvestMan-2.0alpha.tar.gz

The setup.py script can be used to install both programs. I have improved
setup.py a lot and it now does a very good job of pulling in the required
dependencies and doing a clean install. HarvestMan depends on pyparsing
, so this is pulled in automatically, if not found.

The current version of HarvestMan also includes a rudimentary Javascript
parser (2 in fact). There is a pure Python parser written using pyparsing
which can extract Javascript from HTML and do basic processing (like
document.write and Javascript redirection). Then there is another one,
a pure Python re-implementation of RbNarcissus, a pure ruby parser
for Javascript.

Since this is an alpha version, there would be bugs. Also this
dissemination is for a limited audience, so I am announcing this here only.
There is no cheeseshop package yet and no announcement in larger
Python mailing lists (like c.l.py).

If you are interested in the program and in general interested in
web crawling etc, do download it and give it a try. Even if you
are not interested in web crawling, I think the Hget application
would be very useful to you.

Please report bugs preferably at,

http://developer.berlios.de/bugs/?group_id=1873

or email them straight to me.

For anyone interested in development, the project is currently
hosted on the server http://svn.eiao.net . The trunk can be
checked out at http://svn.eiao.net/robacc/experimental/HarvestMan-2.0 .
Kindly note that the trunk is under development and may not be
stable.  I don't yet have the notion of nightly drops etc, since this
is mostly a single person project :)

Thanks & regards,

-- 
-Anand