Mark Sapiro wrote:
Ivan Van Laningham wrote:
But I have one list for which I used archives from two previous incarnations of the list, plus the current archive mbox, as input to arch. I made sure that the previous archives were in mbox format and that they contained only one "From " line per message.
Are you sure? Did you run bin/cleanarch against the .mbox file to check it?
I ran cleanarch, yes, but all it did was to escape every single "From " line, which would make arch think there was only one message.
This usually results from a message containing an embedded "From " somewhere in the message body. The message is archived properly under its correct date and subject, but that entry is truncated at the line that begins with "From ". Then the rest of the message is archived as a separate message. Since it has no From:, Subject: or Date: headers, it is archived with the current date and no subject. Also , text following the "From " up to the first totally empty (not just blank) line is considered part of the header and is not archived with this 'second' message.
That would describe what I'm seeing, except that--
If there is any message body text in the 'No subject' archived entry, you should be able to find that in the .mbox.
Right, but there are 5,000 entries with "No subject" and no body, not a hint of a body.
The _only_ thing I can see, in the current mbox, is that the end of the last message from the old archives ends on one line and the "From " line for the next message begins on the very next line, with no blank lines between,
That shouldn't cause this.
Good to know.
and everywhere else there are either one or more blank lines or one of those message separator lines from AOL:
These bogus entries aren't really hurting anything, I suppose, but they are annoying and it is irritating to have to scroll down 5000 lines to get to the next real message.
They are actually, because they represent missing pieces of other messages.
How to track them down?
What is causing this? And is there anything I can do to get rid of the problem? I am willing to live with it if I have to, but I would prefer having a fix.
I think you have unescaped "From " lines in the bodies of messages. Run bin/cleanarch (with the -n/--dry-run option) to check.
Another possibility is you have real looking but extraneous (duplicate?) "From " lines not followed by a real message with Subject: and Date: headers prior to the next "From ".
Do lines beginning with whitespace before a From count? There are about a hundred of those in the input mbox.