Fredrik, if you would like to help move this all forward, great; I would appreciate the help. You can write a page scraper to get the data out of SF
challenge accepted ;-)
http://effbot.python-hosting.com/browser/stuff/sandbox/sourceforge/
contains three basic tools; getindex to grab index information from a python tracker, getpages to get "raw" xhtml versions of the item pages, and getfiles to get attached files.
I'm currently downloading a tracker snapshot that could be useful for testing; it'll take a few more hours before all data are downloaded (provided that SF doesn't ban me, and I don't stumble upon more cases where a certain rhettinger has pasted binary gunk into an iso-8859-1 form ;-).
alright, it took my poor computer nearly eight hours to grab all the data, and some tracker items needed special treatment to work around some interesting SF bugs, but I've finally managed to download *all* items available via the SF tracker index, and *all* data files available via the item pages: tracker-105470 (bugs) 6682 items 6682 pages (100%) 1912 files tracker-305470 (patches) 3610 items 3610 pages (100%) 4663 files tracker-355470 (feature requests) 430 items 430 pages (100%) 80 files the complete data set is about 300 megabytes uncompressed, and ~85 megabytes zipped. the scripts are designed to make it easy to update the dataset; adding new items and files only takes a couple of minutes; refreshing the item information may take a few hours. ::: I've also added a basic "extract" module which parses the XHTML pages and the data files. this module can be used by import scripts, or be used to convert the dataset into other formats (e.g. a single XML file) for further processing. the source code is available via the above link; I'll post the ZIP file some- where tomorrow (drop me a line if you want the URL). </F>