[Python-Dev] Accessing mailing list archives

Cameron Simpson cs at cskk.id.au
Tue Jul 31 20:46:39 EDT 2018


On 30Jul2018 13:40, Bob Purvy <bpurvy at gmail.com> wrote:
>I've been trying to figure out how to access the archives programmatically.
>I'm sure this is easy once you know, but googling various things hasn't
>worked.  What I want to do is graph the number of messages about PEP 572 by
>time.  (or has someone already done that?)
>
>I installed GNU Mailman, and downloaded the gzip'ed archives for a number
>of months and unzipped them, and I suspect that there's some way to get
>them all into a single database, but it hasn't jumped out at me.  If I
>count the "Message-ID" lines, the "Subject:" lines, and the "\nFrom " lines
>in one of those text files, I get slightly different numbers for each.
>
>Alternatively, they're maybe *already* in a database, and I just need API
>access to do the querying?  Can someone help me out?

Like Victor, I download mailing list archives. Between pulling them in and also 
subscribing, ideally I get a complete history in my "python" mail folder.  
Likewise for other lists.

The mailman archives are UNIX mbox files, compressed, with a bit of header 
munging (to make address harvesting harder). You can concatenate them and 
uncompress and reverse the munging like this:

  cat *.gz | gunzip | fix-mail-dates --mbox | un-at-

where fix-mail-dates is here:

  https://bitbucket.org/cameron_simpson/css/src/tip/bin/fix-mail-dates

and un-at- is here:

  https://bitbucket.org/cameron_simpson/css/src/tip/bin/un-at-

and the output is a nice UNIX mbox file.

You can load that into most mail readers or parse it with Python's email 
modules (in the stdlib). It should be easy enough to scan such a thing and 
count header contents etc. Ignore the "From " line content, prefer the "From:" 
header. (Separate messages on "From " of course, just don't grab email 
addresses from it.)

Cheers,
Cameron Simpson <cs at cskk.id.au>


More information about the Python-Dev mailing list