[Mailman-Users] Exporting archives to a text file

Mark Sapiro mark at msapiro.net
Tue Apr 24 19:20:19 CEST 2012


Stephen J. Turnbull wrote:

>On Tue, Apr 24, 2012 at 12:31 PM, Mark Sapiro <mark at msapiro.net> wrote:
>>
>> There already is a plain text mbox format archive at
>> archives/private/LISTNAME.mbox/LISTNAME.mbox that contains the entire
>> list archive.
>
>I forget, have headers been cleaned out?  Patricia might want to nuke
>all the "Received" headers from the "very old" mbox file, for example,
>which would save considerable space.  This would require splitting the
>current mbox file and rebuilding the archives (thus changing message
>sequence numbers and URLs).


No, The LISTNAME.mbox file contains the entire message. Some headers
may have been removed by pipeline handlers such as Cleanse and
CleanseDKIM or manipulated or added by CookHeaders, but except in the
case of anonymous lists with fairly recent Mailman, Received: headers
are still there.

If you want a fairly sanitized mailbox with most non-essential headers
removed, you can always concatenate the periodic .txt (or .txt.gz, but
you can't directly concatenate those) files, or you can just keep
those files and nuke the rest of the pipermail structure for the
unwanted periods.

E.g. a typical archive for a list with the default monthly archive may
have an archives/private/LISTNAME/ directory, and in it for a typical
month a yyyy-Month directory containing the HTMLized messages and
monthly indices, a yyyy-Month.txt and possibly a yyy-Month.txt.gz
files containing messages with minimal headers after scrubbing of
attachments.

There will also be a archives/private/LISTNAME/attachments/ directory
with subdirectories by date of the form yyyymmdd containing scrubbed
attachments.

One could simply remove the unwanted
archives/private/LISTNAME/yyyy-Month directories and maybe or maybe
not the archives/private/LISTNAME/attachments/yyyymmdd/ directories
leaving the text files which are already linked from the TOC page.
This leaves dead links for the removed Months' Pipermail archives, but
that may be OK.

Another potential space saver if your lists have a lot of scrubbed
attachments and the list is digestable and its scrub_nondigest setting
is the default No is that there will generally be two copies of each
scrubbed attachment in the attachments/ directory, one scrubbed as the
message was archived and one scrubbed from the digest. These have
names like name.ext and name-0001.ext. The tricky part is you don't
know which of these names is referenced by the archived message and
which was referenced in the plain format difest. Typically, the
archived message will reference the name.ext file and the digest the
name-0001.ext file, but when a digest is triggered on size, it may
reference a name.ext file and the archived message the name-0001.ext
file.

Note that simply rebuilding the archive with bin/arch --wipe will
eliminate duplicate stored attachments.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan



More information about the Mailman-Users mailing list