ANN: HarvestMan 1.3.9
15 Jun 2004 01:55:48 -0700
HarvesMan is a multithreaded, highly customizable, web crawler(offline
browser) written in python. It features thread control, download
control using multiple rules, support for robot exclusion protocol,
multiple 'fetch levels', url filters etc. HarvestMan is written in a
modular, object-oriented architecture.
HarvestMan is hosted at http://harvestman.freezope.org, an interactive
Zope based web site. The website provides a bug tracker.
HarvestMan 1.3.9 is the latest release of HarvestMan. The following
features have been added.
1. Url and web site priorities, customizable by user
2. Support for html tidy to clean up web pages to prevent
parser errors & hence download web sites with html pages
that contain errors.
3. Reusable download thread groups.
4. Mixed Intranet/Internet downloads in same project.
5. A modified url caching algorithm based on last modification
time of the url file.
6. Url generations & priorities based on them.
7. Many bugfixes.
HarvestMan is free to use and is released under the Open Software
Latest source code can be obtained from
A comprehensive list of changes is at
FAQ: http://harvestman.freezope.org/faq.html .
Direct Link (for the impatient):
-Anand B Pillai