[Python-Dev] I'm not getting email from SF when assigned abug/patch

Fredrik Lundh fredrik at pythonware.com
Mon Apr 3 00:28:29 CEST 2006


> > Fredrik, if you would like to help move this all forward, great; I
> > would appreciate the help.  You can write a page scraper to get the
> > data out of SF
>
> challenge accepted ;-)
>
> http://effbot.python-hosting.com/browser/stuff/sandbox/sourceforge/
>
> contains three basic tools; getindex to grab index information from a
> python tracker, getpages to get "raw" xhtml versions of the item pages,
> and getfiles to get attached files.
>
> I'm currently downloading a tracker snapshot that could be useful for
> testing; it'll take a few more hours before all data are downloaded
> (provided that SF doesn't ban me, and I don't stumble upon more
> cases where a certain rhettinger has pasted binary gunk into an
> iso-8859-1 form ;-).

alright, it took my poor computer nearly eight hours to grab all the
data, and some tracker items needed special treatment to work around
some interesting SF bugs, but I've finally managed to download *all*
items available via the SF tracker index, and *all* data files available
via the item pages:

    tracker-105470 (bugs)
        6682 items
        6682 pages (100%)
        1912 files
    tracker-305470 (patches)
        3610 items
        3610 pages (100%)
        4663 files
    tracker-355470 (feature requests)
        430 items
        430 pages (100%)
        80 files

the complete data set is about 300 megabytes uncompressed, and ~85
megabytes zipped.

the scripts are designed to make it easy to update the dataset; adding
new items and files only takes a couple of minutes; refreshing the item
information may take a few hours.

:::

I've also added a basic "extract" module which parses the XHTML
pages and the data files.  this module can be used by import scripts,
or be used to convert the dataset into other formats (e.g. a single
XML file) for further processing.

the source code is available via the above link; I'll post the ZIP file some-
where tomorrow (drop me a line if you want the URL).

</F>





More information about the Python-Dev mailing list