
This is not a request for help but a report of experience in case someone else finds it helpful.
I recently migrated some old mailing lists into Mailman. They had previously run on different software (my own), and at first I assumed I'd need to keep two sets of archives, putting the old ones on my regular website (not the "lists." subdomain created by Mailman).
Then I saw in the FAQ that it was possible to edit list archives. The emphasis there was on deleting posts, but I thought, if this works for deleting posts it should also work for adding them.
Fortunately my old archives were already in mbox format. Or rather, almost in mbox format. The old incarnation of my lists had been on a server where I had a low usage quota, so I had been downloading all archives over a year old and storing them on my home computer. In doing so, I had passed them through a word processor macro to do some minimal cleanup, which was chiefly to remove the ">" that mbox files put in front of body lines beginning with "From " ("From the historian's viewpoint," one subscriber wrote).
Undoing that change was easy enough, but what I didn't notice was that word wrap had gotten imposed on some very long header lines (such as "DomainKey-Signatures:"). This damaged the headers and made them appear to end sooner, with some of their data falling through into the message body.
Usually, when this happened, the "Date:" line would be in the part that fell through. Mailman seems to rely on this line when sorting posts by date (it does _not_ rely on the physical order of messages in the mbox file). In the absence of a "Date:" line in the header, Mailman seems to use the current time (when it is indexing the archive).
To fix this I had to go back through the imported mbox files and clean up the headers. Since I was doing this in vi over an SSH connection and couldn't see clearly whether there was a newline character or only a line that was too long for the screen, I decided the safest method was just to delete all those overlong headers. They shouldn't be needed in the archive anyway. (The "Received:" and "Delivered-To:" lines had long since been removed by my program, when it saved out a week's files and started a new archive.)
I also found some "Date:" lines that had been mistaken from the beginning. One of my subscribers wrote that he had just switched to a Mac in order to clear a Windows-based virus out of his mailbox. Somehow his Macintosh had its system date set to August 27, 1956! Mailman made this the first post on the list, followed by a silence of over 40 years. I went back and corrected the date as well as I could and then indexed the archive all over again.
Moral: You can import old mbox files to a Mailman archive, but be sure to clean up the headers before you generate the index.
-- Larry Kuenning larry@qhpress.org

On 05/10/2013 08:27 AM, Larry Kuenning wrote:
Moral: You can import old mbox files to a Mailman archive, but be sure to clean up the headers before you generate the index.
Yes, that's what bin/arch was designed for.
Also, there is a bin/cleanarch tool, but all it does is look for unescaped "From " lines in message bodies.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

On 5/10/2013 10:27 AM, Larry Kuenning wrote, in part:
The ">" in front of "From " in message bodies IS REQUIRED. The separator for individual mail messages in an mbox file is the "From ....." line that contains a data and maybe a sender e-mail address. Any other line that begins in column one with the five-character string "From " will be treated as a message separator.
Technically, any character in the first column before "From " in the next five characters will work; the ">" character is the one that was chosen a long time ago.
--Barry Finkel

On 5/10/2013 3:18 PM, Barry S. Finkel wrote:
The ">" in front of "From " in message bodies IS REQUIRED.
Of course it's required -- as long as the file is serving as a real mbox. That's why I put the ">" back in before having Mailman index the files for the archives of the new lists. ("Undoing that change was easy enough" is how I put it in my first post about this.)
I had started taking out the ">" long ago when I thought the only future use for these files would be as human-readable text files, where the ">" would be perceived as clutter or as a mistaken attempt to indicate quotation. There's no law that says what was once an mbox file has to remain an mbox file forever if nobody's going to read it except in a notepad or word processor. Which is what I expected at the time.
-- Larry Kuenning larry@qhpress.org

Barry S. Finkel writes:
The ">" in front of "From " in message bodies IS REQUIRED.
Only by the archive builder.
Specifically, AFAIK you are correct, Pipermail will split an mbox to messages on any line matching "^From ", and leave any ">From " lines in the resulting archive. There are two ways to improve on this.
Generic: Leave the ">" in the mbox file, and use the macro afterward on the split HTML. (I think this is what the cleanarch script does.)
Site-specific: use a more accurate regexp to identify the message separator, possibly augmented by looking for an empty line before and a RFC 822 header afterward. Then you can clean up the mbox file.
The generic method is actually more accurate (in some contexts people actually do post headers in message bodies :), so I recommend it.

On 05/10/2013 08:27 AM, Larry Kuenning wrote:
Moral: You can import old mbox files to a Mailman archive, but be sure to clean up the headers before you generate the index.
Yes, that's what bin/arch was designed for.
Also, there is a bin/cleanarch tool, but all it does is look for unescaped "From " lines in message bodies.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

On 5/10/2013 10:27 AM, Larry Kuenning wrote, in part:
The ">" in front of "From " in message bodies IS REQUIRED. The separator for individual mail messages in an mbox file is the "From ....." line that contains a data and maybe a sender e-mail address. Any other line that begins in column one with the five-character string "From " will be treated as a message separator.
Technically, any character in the first column before "From " in the next five characters will work; the ">" character is the one that was chosen a long time ago.
--Barry Finkel

On 5/10/2013 3:18 PM, Barry S. Finkel wrote:
The ">" in front of "From " in message bodies IS REQUIRED.
Of course it's required -- as long as the file is serving as a real mbox. That's why I put the ">" back in before having Mailman index the files for the archives of the new lists. ("Undoing that change was easy enough" is how I put it in my first post about this.)
I had started taking out the ">" long ago when I thought the only future use for these files would be as human-readable text files, where the ">" would be perceived as clutter or as a mistaken attempt to indicate quotation. There's no law that says what was once an mbox file has to remain an mbox file forever if nobody's going to read it except in a notepad or word processor. Which is what I expected at the time.
-- Larry Kuenning larry@qhpress.org

Barry S. Finkel writes:
The ">" in front of "From " in message bodies IS REQUIRED.
Only by the archive builder.
Specifically, AFAIK you are correct, Pipermail will split an mbox to messages on any line matching "^From ", and leave any ">From " lines in the resulting archive. There are two ways to improve on this.
Generic: Leave the ">" in the mbox file, and use the macro afterward on the split HTML. (I think this is what the cleanarch script does.)
Site-specific: use a more accurate regexp to identify the message separator, possibly augmented by looking for an empty line before and a RFC 822 header afterward. Then you can clean up the mbox file.
The generic method is actually more accurate (in some contexts people actually do post headers in message bodies :), so I recommend it.
participants (4)
-
Barry S. Finkel
-
Larry Kuenning
-
Mark Sapiro
-
Stephen J. Turnbull