Exporting archives to a text file
Good morning,
I am using Mailman version 2.1.9; with Namazu, version 2.0.15, as my search engine. It is running on a Linux server, running RedHat 4.0
This system has been in place for quite a number of years. I am running out of space and don't have the option to upgrade my hardware. I would like to move the very old archives off of the server and into a text file, or perhaps a pdf, to clear space on my server. I can not just delete the older archives. The clients that use the mailman lists want to be able to see/read the older archives, for reference purposes; hence the need for a text or pdf file.
I have been through the FAQ and searchable archives. I saw a lot of help for exporting the member/subscriber lists to a text file, but not much about the older archives themselves.
Can someone tell me if this is possible and if so how. Thanks much.
PATI MOSS CSC
On Mon, Apr 23, 2012 at 9:43 AM, Patricia A Moss <pmoss4@csc.com> wrote:
Good morning,
I am using Mailman version 2.1.9; with Namazu, version 2.0.15, as my search engine. It is running on a Linux server, running RedHat 4.0
This system has been in place for quite a number of years. I am running out of space and don't have the option to upgrade my hardware. I would like to move the very old archives off of the server and into a text file, or perhaps a pdf, to clear space on my server. I can not just delete the older archives. The clients that use the mailman lists want to be able to see/read the older archives, for reference purposes; hence the need for a text or pdf file.
Where would you store these PDF or text files so that clients can read them? If they are accessible on that server wouldn't they take up about as much room as they do now? (PDF might take up more room.)
I'm not the best person to answer this, but maybe you should look at other ways to save space on the server (e.g., indexes). Another option might be compressing your mailbox files. (Maybe you can even enable compression on an entire partition?)
Again, I'm not an expert on this, but my approach would be to leave the archives in place since clients require access to them and find a way to compress the files or delete *other* files that clients don't require access to. Hope that helps.
Patricia A Moss wrote:
This system has been in place for quite a number of years. I am running out of space and don't have the option to upgrade my hardware. I would like to move the very old archives off of the server and into a text file, or perhaps a pdf, to clear space on my server.
There already is a plain text mbox format archive at archives/private/LISTNAME.mbox/LISTNAME.mbox that contains the entire list archive.
Let's say this contains 10,000 messages for SOMELIST and you would like to keep the pipermail archive for only the more recent half. You have two choices. If you want to preserve the URLs for those messages you are keeping in the Pipermail archive, simple delete the older archives/private/SOMELIST/xxx periodic directories. The drawback of this approach is the deleted entries will still be in the TOC, and will return even if you edit the TOC and remove them.
If you don't care about preserving URLs, you could instead do
bin/arch --wipe --start=5000 SOMELIST
to rebuild the Pipermail archive skipping the first 5000 messages.
I can not just delete the older archives. The clients that use the mailman lists want to be able to see/read the older archives, for reference purposes; hence the need for a text or pdf file.
The file is there. It can be accessed via the private archive URL like
http://www.example.com/mailman/private/LISTNAME.mbox/LISTNAME.mbox
whether the archive is private or public, but this requires authentication as a list member. You can make it accessible for a public (or private) archive via a link on the TOC by putting
PUBLIC_MBOX = Yes
in mm_cfg.py.
I don't know how any of this will impact namazu. Of course any search engine could find older messages in the LISTNAME.mbox file, but you already know all messages are there. The users could just visit the file in their browsers and use their browser search capabilities to find what they're looking for.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On Tue, Apr 24, 2012 at 12:31 PM, Mark Sapiro <mark@msapiro.net> wrote:
Patricia A Moss wrote:
I would like to move the very old archives off of the server and into a text file, or perhaps a pdf, to clear space on my server.
There already is a plain text mbox format archive at archives/private/LISTNAME.mbox/LISTNAME.mbox that contains the entire list archive.
I forget, have headers been cleaned out? Patricia might want to nuke all the "Received" headers from the "very old" mbox file, for example, which would save considerable space. This would require splitting the current mbox file and rebuilding the archives (thus changing message sequence numbers and URLs).
I don't know how any of this will impact namazu. Of course any search engine could find older messages in the LISTNAME.mbox file, but you already know all messages are there.
I think Namazu can probably be configured to look in the mbox file, and this would be worth doing. The search instructions should remark that the mbox file is a last resort (since if you just keep the original mbox as you suggest, it will hit every time if anything hits!)
The users could just visit the file in their browsers and use their browser search capabilities to find what they're looking for.
Yes, they would have to do that, but they might prefer to investigate other, more precise, hits first, as well as being informed that "that key doesn't exist AFAIK" on a failed search.
Stephen J. Turnbull wrote:
On Tue, Apr 24, 2012 at 12:31 PM, Mark Sapiro <mark@msapiro.net> wrote:
There already is a plain text mbox format archive at archives/private/LISTNAME.mbox/LISTNAME.mbox that contains the entire list archive.
I forget, have headers been cleaned out? Patricia might want to nuke all the "Received" headers from the "very old" mbox file, for example, which would save considerable space. This would require splitting the current mbox file and rebuilding the archives (thus changing message sequence numbers and URLs).
No, The LISTNAME.mbox file contains the entire message. Some headers may have been removed by pipeline handlers such as Cleanse and CleanseDKIM or manipulated or added by CookHeaders, but except in the case of anonymous lists with fairly recent Mailman, Received: headers are still there.
If you want a fairly sanitized mailbox with most non-essential headers removed, you can always concatenate the periodic .txt (or .txt.gz, but you can't directly concatenate those) files, or you can just keep those files and nuke the rest of the pipermail structure for the unwanted periods.
E.g. a typical archive for a list with the default monthly archive may have an archives/private/LISTNAME/ directory, and in it for a typical month a yyyy-Month directory containing the HTMLized messages and monthly indices, a yyyy-Month.txt and possibly a yyy-Month.txt.gz files containing messages with minimal headers after scrubbing of attachments.
There will also be a archives/private/LISTNAME/attachments/ directory with subdirectories by date of the form yyyymmdd containing scrubbed attachments.
One could simply remove the unwanted archives/private/LISTNAME/yyyy-Month directories and maybe or maybe not the archives/private/LISTNAME/attachments/yyyymmdd/ directories leaving the text files which are already linked from the TOC page. This leaves dead links for the removed Months' Pipermail archives, but that may be OK.
Another potential space saver if your lists have a lot of scrubbed attachments and the list is digestable and its scrub_nondigest setting is the default No is that there will generally be two copies of each scrubbed attachment in the attachments/ directory, one scrubbed as the message was archived and one scrubbed from the digest. These have names like name.ext and name-0001.ext. The tricky part is you don't know which of these names is referenced by the archived message and which was referenced in the plain format difest. Typically, the archived message will reference the name.ext file and the digest the name-0001.ext file, but when a digest is triggered on size, it may reference a name.ext file and the archived message the name-0001.ext file.
Note that simply rebuilding the archive with bin/arch --wipe will eliminate duplicate stored attachments.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On Wed, Apr 25, 2012 at 2:20 AM, Mark Sapiro <mark@msapiro.net> wrote:
Stephen J. Turnbull wrote:
I forget, have headers been cleaned out?
No, The LISTNAME.mbox file contains the entire message. Some headers may have been removed by pipeline handlers [...], but [...] Received: headers are still there.
If you want a fairly sanitized mailbox with most non-essential headers removed, you can always concatenate the periodic .txt (or .txt.gz, but you can't directly concatenate those) files, or you can just keep those files and nuke the rest of the pipermail structure for the unwanted periods.
"Just keep those files" and configure Namazu to index them sounds like the best idea yet. Although they'll cut the thread (but ISTR the pipermail archive does too), that localizes search results (and uses bandwidth!) better than using the whole mbox.
participants (4)
-
David
-
Mark Sapiro
-
Patricia A Moss
-
Stephen J. Turnbull