Download pipermail archives, convert to mbox file (script)
Hi,
Sometimes before subscribing to a list I like to download the archives and convert them into an mbox file for nice threaded browsing & searching using familiar tools (for me, Mutt and mairix). Not finding an automated way to do this [1], I put together the following shell script. Simple & rough, but seems to do the job:
#!/bin/sh # automated retrieval of pipermail archives & conversion to mbox file # Last edit: 2012/10/09 Tue 23:16 PDT listname=$(echo "$1" | sed 's:^\(http.*\)/\([^/]*\)/$:\2:') cd /tmp wget -r -l 1 -nH -A *.txt.gz "$1" touch /tmp/pipermail/$listname/$listname.mbox chmod 600 /tmp/pipermail/$listname/$listname.mbox cd /tmp/pipermail/$listname for f in $(ls |sort) do zcat $f | iconv -f iso8859-15 -t utf-8 | sed 's/\(^From.*\)\ at\ /\1@/' >> "$listname.mbox" done rm /tmp/pipermail/$listname/*.gz mutt -f /tmp/pipermail/$listname/$listname.mbox
I call this script piperget, and by doing:
piperget http://example.tld/pipermail/somelistname/
the file /tmp/pipermail/somelistname.mbox is created and opened by mutt. If I like what I see, I move the mbox file to an appropriate location in my Mail directory, subscribe to the list, and filter the list traffic into that mbox.
This could be made more robust and tweaked to better suit varying needs. Being able to specify a range of archive dates would be nice. Another thought is to have the option of leaving the last few *.txt.gz files laying around (somewhere other than in /tmp), checking against them to only wget new archives or an archive with a newer time-stamp, then concatenating newer messages onto the existing mbox. A sort of a pseudo-subscription to a list. Repeatedly re-downloading an entire monthly/quarterly archive as it changes would be rather bandwidth-wasteful though, better to subscribe and update the *.mbox via SMTP. Not sure if there's some rsync way to incrementally download only the parts of an archive that've changed... Anyhow, mostly I just use this to catch up on a list at the moment of deciding whether or not to subscribe to it. Any thoughts or suggestions are welcome.
[1] After writing this script I did find: https://github.com/wesleyd/pipermail-archive-to-maildir Which could be another option for those interested in the maildir format. I prefer mbox for mailing lists.
John
-- John Magolske http://B79.net/contact
John Magolske wrote:
Sometimes before subscribing to a list I like to download the archives and convert them into an mbox file for nice threaded browsing & searching using familiar tools (for me, Mutt and mairix).
If the list's archive is public and you are not a subscriber, your script is probably fine (I didn't look in detail), but if you are willing to subscribe first, whether the archives are private or public, you can get the list's entire cummulative mbox archive with something like
wget 'http://www.example.com/mailman/private/LIST.mbox/LIST.mbox?username=U&password=P'
where LIST is the list name, U is a list member's address and P is that member's list password. This has the advantage of getting all the message's headers as processed by Mailman with the exception of those added by SMTPDirect.py (Sender: and Errors-To:), not just those few that are in the periodic .txt or .txt.gz files.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
- Mark Sapiro <mark@msapiro.net> [121017 16:57]:
If the list's archive is public and you are not a subscriber, your script is probably fine (I didn't look in detail), but if you are willing to subscribe first, whether the archives are private or public, you can get the list's entire cummulative mbox archive with something like
wget 'http://www.example.com/mailman/private/LIST.mbox/LIST.mbox?username=U&password=P'
where LIST is the list name, U is a list member's address and P is that member's list password. This has the advantage of getting all the message's headers as processed by Mailman with the exception of those added by SMTPDirect.py (Sender: and Errors-To:), not just those few that are in the periodic .txt or .txt.gz files.
Thanks, this is great for catching up on subscribed-to lists. I just used this to download the entire history of mailman-users into one 247MB mbox file. The only post-processing required involved removing the first line (which was blank) of the file.
Question -- does that comprehensive mbox file exist on the server somewhere (ie, not generated per request)? I'm wondering if it'd be possible to set up rsync to do incremental updates and mirror backups of an archive to other locations. I'm guessing rsync's delta-transfer algorithm would use roughly the same amount of bandwidth as SMTP... though it would re-write the entire mbox file at the destination with each sync.
But also, I was thinking this could be used to fill gaps in list traffic (when away from the net for extended periods of time & the inbox exceeds number of allowed messages, mail server goes down for some reason, etc.), offering a way to sync up without re-downloading a potentially huge file. But maybe in this case a scheme for limiting the download to a certain date range similar to how Gmane allows setting a range of message numbers in a download URL [1] would make more sense. Is there such a functionality in Mailman?
[1] http://gmane.org/export.php
Regards,
John
-- John Magolske http://B79.net/contact
John Magolske wrote:
Question -- does that comprehensive mbox file exist on the server somewhere (ie, not generated per request)?
Yes. It is archives/private/LISTNAME.mbox/LISTNAME.mbox.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (2)
-
John Magolske
-
Mark Sapiro