
Ivan Van Laningham wrote:
But I have one list for which I used archives from two previous incarnations of the list, plus the current archive mbox, as input to arch. I made sure that the previous archives were in mbox format and that they contained only one "From " line per message.
Are you sure? Did you run bin/cleanarch against the .mbox file to check it?
Once I was convinced they were all ready, I combined the old archive mbox with the current archive mbox using cat, and ran arch.
It worked perfectly, creating archive pages going all the way back to 1999, except that in the archive page for the month in which I ran arch (May) for the day on which I ran it (May 7), I have in the vicinity of 5000 entries for messages with "No subject" and no body. The index page for May looks like this:
# [Guppies] Malice 2008 Suzanne Williams # No subject # No subject # No subject ... 5000 entries # No subject # No subject # [Guppies] harsh words for cheating peg908 at aol.com # [Guppies] harsh words for cheating Vwright
This usually results from a message containing an embedded "From " somewhere in the message body. The message is archived properly under its correct date and subject, but that entry is truncated at the line that begins with "From ". Then the rest of the message is archived as a separate message. Since it has no From:, Subject: or Date: headers, it is archived with the current date and no subject. Also , text following the "From " up to the first totally empty (not just blank) line is considered part of the header and is not archived with this 'second' message.
I tried to find these mysterious entries in the current archive mbox, but they don't appear.
If there is any message body text in the 'No subject' archived entry, you should be able to find that in the .mbox.
The _only_ thing I can see, in the current mbox, is that the end of the last message from the old archives ends on one line and the "From " line for the next message begins on the very next line, with no blank lines between,
That shouldn't cause this.
and everywhere else there are either one or more blank lines or one of those message separator lines from AOL:
"----------MB_8C9379FAFA8ECEC_DAC_6C2A_WEBMAIL-MC05.sysops.aol.com--"<
These bogus entries aren't really hurting anything, I suppose, but they are annoying and it is irritating to have to scroll down 5000 lines to get to the next real message.
They are actually, because they represent missing pieces of other messages.
What is causing this? And is there anything I can do to get rid of the problem? I am willing to live with it if I have to, but I would prefer having a fix.
I think you have unescaped "From " lines in the bodies of messages. Run bin/cleanarch (with the -n/--dry-run option) to check.
Another possibility is you have real looking but extraneous (duplicate?) "From " lines not followed by a real message with Subject: and Date: headers prior to the next "From ".