[Tracker-discuss] Import work begun.

Erik Forsberg forsberg at efod.se
Tue Nov 7 20:51:17 CET 2006


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Barry Warsaw <barry at python.org> writes:

> On Nov 7, 2006, at 11:51 AM, Erik Forsberg wrote:
>
>> Now working on getting the importer to work again. It's been a while
>> since I ran it, and there's been a new release of the sourceforge
>> tools from effbot due to changes on the sourceforge web site.
>
> Let me know how that goes.  I've been using r428 of /F's stuff and
> ran into several problems getting a clean export of the Mailman
> trackers.  I contacted Fredrik about that but I think he and I have
> both been too busy.

Well, it didn't work for me either - failed to find the description as
well as the comments. Here's a patch to fix that:

- --snip--
Index: extract.py
===================================================================
- --- extract.py	(revision 428)
+++ extract.py	(working copy)
@@ -95,13 +95,13 @@
     table = elem.find("table")
 
     # locate the description
- -    for tr in table:
+    for tr in table[1:]:
         if len(tr) == 1 and tr[0].get("colspan") == "2":
             # map <br> to newlines
             for br in tr.findall(".//br"):
                 br.text = chr(0) # temporarily use NULL as line terminator
- -                if br.tail and br.tail.startswith("\n"):
- -                    br.tail = br.tail[1:] # trip extra newlines
+                if br.tail and br.tail.startswith("\r\n"):
+                    br.tail = br.tail[2:] # trip extra newlines
             text = gettext(tr)
             if text.startswith("\n\n\t\t\t"):
                 text = text[5:]
@@ -128,7 +128,7 @@
         elif td and td[0].tag == "h3":
             key = gettext(td[0]).strip()
             if key == "Followups:":
- -                for i, e in enumerate(td.findall("table/tr/td")):
+                for i, e in enumerate(td.findall("p/table/tr/td")):
                     if i:
                         data = getcomment(e)
                         result.setdefault("comments", []).append(data)
- --snap--

I'm not sure this solves all possible scraping trouble, but at least
it's a start.

Cc to Fredrik to let him update his repo. And please spell my surname
correctly in the commit message this time ;-).

Cheers,
\EF
- -- 
Erik Forsberg                 http://efod.se
GPG/PGP Key: 1024D/0BAC89D9
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8+ <http://mailcrypt.sourceforge.net/>

iD8DBQFFUOO1rJurFAusidkRAqhxAKCRHDFxLnj2a6rncWjHpkG3nsIbNQCgiCRF
EIiB5y3i8iWebF9WomI9KAA=
=GguG
-----END PGP SIGNATURE-----


More information about the Tracker-discuss mailing list