
Stephen J. Turnbull wrote:
On Tue, Apr 24, 2012 at 12:31 PM, Mark Sapiro <mark@msapiro.net> wrote:
There already is a plain text mbox format archive at archives/private/LISTNAME.mbox/LISTNAME.mbox that contains the entire list archive.
I forget, have headers been cleaned out? Patricia might want to nuke all the "Received" headers from the "very old" mbox file, for example, which would save considerable space. This would require splitting the current mbox file and rebuilding the archives (thus changing message sequence numbers and URLs).
No, The LISTNAME.mbox file contains the entire message. Some headers may have been removed by pipeline handlers such as Cleanse and CleanseDKIM or manipulated or added by CookHeaders, but except in the case of anonymous lists with fairly recent Mailman, Received: headers are still there.
If you want a fairly sanitized mailbox with most non-essential headers removed, you can always concatenate the periodic .txt (or .txt.gz, but you can't directly concatenate those) files, or you can just keep those files and nuke the rest of the pipermail structure for the unwanted periods.
E.g. a typical archive for a list with the default monthly archive may have an archives/private/LISTNAME/ directory, and in it for a typical month a yyyy-Month directory containing the HTMLized messages and monthly indices, a yyyy-Month.txt and possibly a yyy-Month.txt.gz files containing messages with minimal headers after scrubbing of attachments.
There will also be a archives/private/LISTNAME/attachments/ directory with subdirectories by date of the form yyyymmdd containing scrubbed attachments.
One could simply remove the unwanted archives/private/LISTNAME/yyyy-Month directories and maybe or maybe not the archives/private/LISTNAME/attachments/yyyymmdd/ directories leaving the text files which are already linked from the TOC page. This leaves dead links for the removed Months' Pipermail archives, but that may be OK.
Another potential space saver if your lists have a lot of scrubbed attachments and the list is digestable and its scrub_nondigest setting is the default No is that there will generally be two copies of each scrubbed attachment in the attachments/ directory, one scrubbed as the message was archived and one scrubbed from the digest. These have names like name.ext and name-0001.ext. The tricky part is you don't know which of these names is referenced by the archived message and which was referenced in the plain format difest. Typically, the archived message will reference the name.ext file and the digest the name-0001.ext file, but when a digest is triggered on size, it may reference a name.ext file and the archived message the name-0001.ext file.
Note that simply rebuilding the archive with bin/arch --wipe will eliminate duplicate stored attachments.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan