Ivan Van Laningham wrote:
Ah. Now we're getting somewhere. Here are some sample "From " lines:
- From the current list.mbox (leading '> ' not part of actual line):
From Lizzelvin@aol.com Sun Mar 18 18:17:56 2007
This is a normal Unix From_
- From the old mbox which I want to incorporate (leading '> ' inserted):
From "robyn m. fritz" email@example.com
From Mochie@webtv.net (C Ryplansky)
These are non-standard separators
And here is the _fromlinepattern:
_fromlinepattern = r"From \s*[^\s]+\s+\w\w\w\s+\w\w\w\s+\d?\d\s+"
Now, I don't understand much of this pattern, but it looks to me as if a) there's no provision for matching " or < or > characters; and b) some sort of date/time mark is required.
This pattern is used by cleanarch to try to separate a standard Unix From_ separator from other lines that just happen to begin with "From ". It matches "From " followed by whitespace-delimited fields containing any non-whitespace - email address 3 alphanumercs - day of week 3 alphanumerics - month 1 or 2 digits - day of month 1 or 2 digits, colon, 2 digits, optional colon and 2 digits - hh:mm(:ss) optional any non whitespace - time zone offset 4 digits - year
So yes, it looks for a single email address and a date in a specific format. The email address can be bracketed - firstname.lastname@example.org and doesn't really have to look like a valid email address, but it can't contain whitespace, thus it can't have a 'real name' unless it has no whitespace such as email@example.com.
This is only used by cleanarch. Pipermail doesn't care about the contents of the From_ separator. It assumes any line that begins with "From " is a separator and ignores the rest of the line.
All the "From " lines are terminated with a \n, and all are followed immediately by what look like valid message header lines, so I don't think those are problems. There do appear to be 1006 unescaped "From " lines in the old mbox:
$ grep '^From ' guppies-out.mbox | wc 46295 163728 1800087 $ grep '^From: ' guppies-out.mbox | wc 45289 159710 1803623
This seems to indicate a problem, but still doesn't account for 5000 spurious archive entries.
So, if I process the old mbox and convert the "From " lines without dates into "From " lines without " and <> and add a date/time stamp, and THEN run cleanarch, cleanarch should escape only the 1006 non-matching "From " lines, and I should end up with an mbox I can combine with March, April and May of 2007 from the current list. Is that a correct assessment?
That is correct, but if you can process the old mbox and identify which "From " lines without dates are actually message separators, then you should be able to identify which ones are not message separators and just escape those. I.e. create your own archive cleaner specific to this situation.