HarvesMan is a multithreaded, highly customizable, web crawler(offline browser) written in python. It features thread control, download control using multiple rules, support for robot exclusion protocol, multiple 'fetch levels', url filters etc. HarvestMan is written in a modular, object-oriented architecture. HarvestMan is hosted at http://harvestman.freezope.org, an interactive Zope based web site. The website provides a bug tracker. HarvestMan 1.3.9 is the latest release of HarvestMan. The following features have been added. 1. Url and web site priorities, customizable by user 2. Support for html tidy to clean up web pages to prevent parser errors & hence download web sites with html pages that contain errors. 3. Reusable download thread groups. 4. Mixed Intranet/Internet downloads in same project. 5. A modified url caching algorithm based on last modification time of the url file. 6. Url generations & priorities based on them. 7. Many bugfixes. HarvestMan is free to use and is released under the Open Software License. Latest source code can be obtained from http://harvestman/freezope.org/download.html . A comprehensive list of changes is at http://harvestman.freezope.org/files/Changelog.txt . FAQ: http://harvestman.freezope.org/faq.html . Direct Link (for the impatient): http://harvestman.freezope.org/files/download/HarvestMan-1.3.9.tar.gz Thank You! -Anand B Pillai
participants (1)
-
pythonguyï¼ Hotpop.com