Hi All-- I have managed to recover and restore all the archives, covering eight or nine years, for all my mailing lists, following the excellent advice and pointers given by members of this list.
But I have one list for which I used archives from two previous incarnations of the list, plus the current archive mbox, as input to arch. I made sure that the previous archives were in mbox format and that they contained only one "From " line per message. Once I was convinced they were all ready, I combined the old archive mbox with the current archive mbox using cat, and ran arch.
It worked perfectly, creating archive pages going all the way back to 1999, except that in the archive page for the month in which I ran arch (May) for the day on which I ran it (May 7), I have in the vicinity of 5000 entries for messages with "No subject" and no body. The index page for May looks like this:
# [Guppies] Malice 2008 Suzanne Williams # No subject # No subject # No subject ... 5000 entries # No subject # No subject # [Guppies] harsh words for cheating peg908 at aol.com # [Guppies] harsh words for cheating Vwright
I tried to find these mysterious entries in the current archive mbox, but they don't appear. The _only_ thing I can see, in the current mbox, is that the end of the last message from the old archives ends on one line and the "From " line for the next message begins on the very next line, with no blank lines between, and everywhere else there are either one or more blank lines or one of those message separator lines from AOL:
"----------MB_8C9379FAFA8ECEC_DAC_6C2A_WEBMAIL-MC05.sysops.aol.com--"<
These bogus entries aren't really hurting anything, I suppose, but they are annoying and it is irritating to have to scroll down 5000 lines to get to the next real message.
What is causing this? And is there anything I can do to get rid of the problem? I am willing to live with it if I have to, but I would prefer having a fix.
Thanks!
Metta, Ivan
Ivan Van Laningham God N Locomotive Works http://www.pauahtun.org/ http://www.python.org/workshops/1998-11/proceedings/papers/laningham/laningh... Army Signal Corps: Cu Chi, Class of '70 Author: Teach Yourself Python in 24 Hours
Ivan Van Laningham wrote:
But I have one list for which I used archives from two previous incarnations of the list, plus the current archive mbox, as input to arch. I made sure that the previous archives were in mbox format and that they contained only one "From " line per message.
Are you sure? Did you run bin/cleanarch against the .mbox file to check it?
Once I was convinced they were all ready, I combined the old archive mbox with the current archive mbox using cat, and ran arch.
It worked perfectly, creating archive pages going all the way back to 1999, except that in the archive page for the month in which I ran arch (May) for the day on which I ran it (May 7), I have in the vicinity of 5000 entries for messages with "No subject" and no body. The index page for May looks like this:
# [Guppies] Malice 2008 Suzanne Williams # No subject # No subject # No subject ... 5000 entries # No subject # No subject # [Guppies] harsh words for cheating peg908 at aol.com # [Guppies] harsh words for cheating Vwright
This usually results from a message containing an embedded "From " somewhere in the message body. The message is archived properly under its correct date and subject, but that entry is truncated at the line that begins with "From ". Then the rest of the message is archived as a separate message. Since it has no From:, Subject: or Date: headers, it is archived with the current date and no subject. Also , text following the "From " up to the first totally empty (not just blank) line is considered part of the header and is not archived with this 'second' message.
I tried to find these mysterious entries in the current archive mbox, but they don't appear.
If there is any message body text in the 'No subject' archived entry, you should be able to find that in the .mbox.
The _only_ thing I can see, in the current mbox, is that the end of the last message from the old archives ends on one line and the "From " line for the next message begins on the very next line, with no blank lines between,
That shouldn't cause this.
and everywhere else there are either one or more blank lines or one of those message separator lines from AOL:
"----------MB_8C9379FAFA8ECEC_DAC_6C2A_WEBMAIL-MC05.sysops.aol.com--"<
These bogus entries aren't really hurting anything, I suppose, but they are annoying and it is irritating to have to scroll down 5000 lines to get to the next real message.
They are actually, because they represent missing pieces of other messages.
What is causing this? And is there anything I can do to get rid of the problem? I am willing to live with it if I have to, but I would prefer having a fix.
I think you have unescaped "From " lines in the bodies of messages. Run bin/cleanarch (with the -n/--dry-run option) to check.
Another possibility is you have real looking but extraneous (duplicate?) "From " lines not followed by a real message with Subject: and Date: headers prior to the next "From ".
-- Mark Sapiro <msapiro@value.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Hi All--
Mark Sapiro wrote:
Ivan Van Laningham wrote:
But I have one list for which I used archives from two previous incarnations of the list, plus the current archive mbox, as input to arch. I made sure that the previous archives were in mbox format and that they contained only one "From " line per message.
Are you sure? Did you run bin/cleanarch against the .mbox file to check it?
I ran cleanarch, yes, but all it did was to escape every single "From " line, which would make arch think there was only one message.
This usually results from a message containing an embedded "From " somewhere in the message body. The message is archived properly under its correct date and subject, but that entry is truncated at the line that begins with "From ". Then the rest of the message is archived as a separate message. Since it has no From:, Subject: or Date: headers, it is archived with the current date and no subject. Also , text following the "From " up to the first totally empty (not just blank) line is considered part of the header and is not archived with this 'second' message.
That would describe what I'm seeing, except that--
If there is any message body text in the 'No subject' archived entry, you should be able to find that in the .mbox.
Right, but there are 5,000 entries with "No subject" and no body, not a hint of a body.
The _only_ thing I can see, in the current mbox, is that the end of the last message from the old archives ends on one line and the "From " line for the next message begins on the very next line, with no blank lines between,
That shouldn't cause this.
Good to know.
and everywhere else there are either one or more blank lines or one of those message separator lines from AOL:
"----------MB_8C9379FAFA8ECEC_DAC_6C2A_WEBMAIL-MC05.sysops.aol.com--"< These bogus entries aren't really hurting anything, I suppose, but they are annoying and it is irritating to have to scroll down 5000 lines to get to the next real message.
They are actually, because they represent missing pieces of other messages.
How to track them down?
What is causing this? And is there anything I can do to get rid of the problem? I am willing to live with it if I have to, but I would prefer having a fix.
I think you have unescaped "From " lines in the bodies of messages. Run bin/cleanarch (with the -n/--dry-run option) to check.
Another possibility is you have real looking but extraneous (duplicate?) "From " lines not followed by a real message with Subject: and Date: headers prior to the next "From ".
Do lines beginning with whitespace before a From count? There are about a hundred of those in the input mbox.
Metta, Ivan
Ivan Van Laningham God N Locomotive Works http://www.pauahtun.org/ http://www.python.org/workshops/1998-11/proceedings/papers/laningham/laningh... Army Signal Corps: Cu Chi, Class of '70 Author: Teach Yourself Python in 24 Hours
Ivan Van Laningham wrote:
I ran cleanarch, yes, but all it did was to escape every single "From " line, which would make arch think there was only one message.
Then either the From line doesn't match the pattern mailbox.UnixMailbox._fromlinepattern or it is not followed immediately (with no intervening lines or maybe even '\r') by a line that looks like a message header.
If there is intervening whitespace between the "From " line and the message headers, that may cause the spurious archived empty messages.
Do lines beginning with whitespace before a From count? There are about a hundred of those in the input mbox.
They shouldn't be a problem.
-- Mark Sapiro <msapiro@value.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Hi All--
Mark Sapiro wrote:
Ivan Van Laningham wrote:
I ran cleanarch, yes, but all it did was to escape every single "From " line, which would make arch think there was only one message.
Then either the From line doesn't match the pattern mailbox.UnixMailbox._fromlinepattern or it is not followed immediately (with no intervening lines or maybe even '\r') by a line that looks like a message header.
If there is intervening whitespace between the "From " line and the message headers, that may cause the spurious archived empty messages.
Ah. Now we're getting somewhere. Here are some sample "From " lines:
- From the current list.mbox (leading '> ' not part of actual line):
- From the old mbox which I want to incorporate (leading '> ' inserted):
From "robyn m. fritz" <rfritz@nwlink.com> or From Mochie@webtv.net (C Ryplansky)
From Lizzelvin@aol.com Sun Mar 18 18:17:56 2007
And here is the _fromlinepattern:
_fromlinepattern = r"From \s*[^\s]+\s+\w\w\w\s+\w\w\w\s+\d?\d\s+"
r"\d?\d:\d\d(:\d\d)?(\s+[^\s]+)?\s+\d\d\d\d\s*$"
Now, I don't understand much of this pattern, but it looks to me as if a) there's no provision for matching " or < or > characters; and b) some sort of date/time mark is required.
All the "From " lines are terminated with a \n, and all are followed immediately by what look like valid message header lines, so I don't think those are problems. There do appear to be 1006 unescaped "From " lines in the old mbox:
$ grep '^From ' guppies-out.mbox | wc 46295 163728 1800087 $ grep '^From: ' guppies-out.mbox | wc 45289 159710 1803623
So, if I process the old mbox and convert the "From " lines without dates into "From " lines without " and <> and add a date/time stamp, and THEN run cleanarch, cleanarch should escape only the 1006 non-matching "From " lines, and I should end up with an mbox I can combine with March, April and May of 2007 from the current list. Is that a correct assessment?
Metta, Ivan
Ivan Van Laningham God N Locomotive Works http://www.pauahtun.org/ http://www.python.org/workshops/1998-11/proceedings/papers/laningham/laningh... Army Signal Corps: Cu Chi, Class of '70 Author: Teach Yourself Python in 24 Hours
Ivan Van Laningham wrote:
Ah. Now we're getting somewhere. Here are some sample "From " lines:
- From the current list.mbox (leading '> ' not part of actual line):
From Lizzelvin@aol.com Sun Mar 18 18:17:56 2007
This is a normal Unix From_
- From the old mbox which I want to incorporate (leading '> ' inserted):
From "robyn m. fritz" <rfritz@nwlink.com> or From Mochie@webtv.net (C Ryplansky)
These are non-standard separators
And here is the _fromlinepattern:
_fromlinepattern = r"From \s*[^\s]+\s+\w\w\w\s+\w\w\w\s+\d?\d\s+"
r"\d?\d:\d\d(:\d\d)?(\s+[^\s]+)?\s+\d\d\d\d\s*$"Now, I don't understand much of this pattern, but it looks to me as if a) there's no provision for matching " or < or > characters; and b) some sort of date/time mark is required.
This pattern is used by cleanarch to try to separate a standard Unix From_ separator from other lines that just happen to begin with "From ". It matches "From " followed by whitespace-delimited fields containing any non-whitespace - email address 3 alphanumercs - day of week 3 alphanumerics - month 1 or 2 digits - day of month 1 or 2 digits, colon, 2 digits, optional colon and 2 digits - hh:mm(:ss) optional any non whitespace - time zone offset 4 digits - year
So yes, it looks for a single email address and a date in a specific format. The email address can be bracketed - <user@example.com> and doesn't really have to look like a valid email address, but it can't contain whitespace, thus it can't have a 'real name' unless it has no whitespace such as johnsmith<jsmith@example.com>.
This is only used by cleanarch. Pipermail doesn't care about the contents of the From_ separator. It assumes any line that begins with "From " is a separator and ignores the rest of the line.
All the "From " lines are terminated with a \n, and all are followed immediately by what look like valid message header lines, so I don't think those are problems. There do appear to be 1006 unescaped "From " lines in the old mbox:
$ grep '^From ' guppies-out.mbox | wc 46295 163728 1800087 $ grep '^From: ' guppies-out.mbox | wc 45289 159710 1803623
This seems to indicate a problem, but still doesn't account for 5000 spurious archive entries.
So, if I process the old mbox and convert the "From " lines without dates into "From " lines without " and <> and add a date/time stamp, and THEN run cleanarch, cleanarch should escape only the 1006 non-matching "From " lines, and I should end up with an mbox I can combine with March, April and May of 2007 from the current list. Is that a correct assessment?
That is correct, but if you can process the old mbox and identify which "From " lines without dates are actually message separators, then you should be able to identify which ones are not message separators and just escape those. I.e. create your own archive cleaner specific to this situation.
-- Mark Sapiro <msapiro@value.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Hi All--
Mark Sapiro wrote:
So, if I process the old mbox and convert the "From " lines without dates into "From " lines without " and <> and add a date/time stamp, and THEN run cleanarch, cleanarch should escape only the 1006 non-matching "From " lines, and I should end up with an mbox I can combine with March, April and May of 2007 from the current list. Is that a correct assessment?
That is correct, but if you can process the old mbox and identify which "From " lines without dates are actually message separators, then you should be able to identify which ones are not message separators and just escape those. I.e. create your own archive cleaner specific to this situation.
Which is exactly what I did. I ran cleanarch on the result, and it found four instances of bad email addresses, as in "foo bar"@spam.org (the " were part of the address), but luckily, those four instances were forwarded messages, and did indeed need to be escaped.
OK. Now I have a large inbox to re-process (110 MB), but before I do that, I have to remove all the previously processed messages from the current archive. The FAQ ("3.3. How can I remove a post from the list archive / remove an entire archive?") says to "edit the raw archive".
Editing 122 MB of raw archive is going to take some time, since I have to throw away 110 MB of it. I'd like to prevent new messages from coming into the system while I'm editing it, and I seem to be overlooking instructions on how to lock the list. I find that the help message for withlist tells me how to lock the list while I operate on it using withlist, but is that what I want? Can I vi/emacs the mbox while it is locked with withlist?
Am I simply obtuse, or is there no way to lock the list while I'm editing? Or do I throw caution to the winds and blithely edit without concern for incoming messages?
Thanks for all your help, and patience.
Metta, Ivan
Ivan Van Laningham God N Locomotive Works http://www.pauahtun.org/ http://www.python.org/workshops/1998-11/proceedings/papers/laningham/laningh... Army Signal Corps: Cu Chi, Class of '70 Author: Teach Yourself Python in 24 Hours
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On May 25, 2007, at 9:30 AM, Ivan Van Laningham wrote:
Editing 122 MB of raw archive is going to take some time, since I have to throw away 110 MB of it. I'd like to prevent new messages from coming into the system while I'm editing it, and I seem to be overlooking instructions on how to lock the list. I find that the
help message for withlist tells me how to lock the list while I operate
on it using withlist, but is that what I want? Can I vi/emacs the mbox
while it is locked with withlist?Am I simply obtuse, or is there no way to lock the list while I'm editing? Or do I throw caution to the winds and blithely edit without concern for incoming messages?
Hi Ivan,
Can you just turn off mailmanctl while you're editing the inbox?
Okay, this will shut down your Mailman system globally, which you
might not want to do, but it's as safe as it gets.
If you specifically want to block messages just to the list your
editing, then yes, lock it with bin/withlist and just edit the mbox.
Another option is to restart mailmanctl but temporarily disable the
ArchRunner. You should be safe to edit the mbox file then, and just
delay updating all your archives (including the mbox archives) until
you've finished your surgery, while the rest of the system continues
to churn away.
Cheers,
- -Barry
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)
iQCVAwUBRlbpP3EjvBPtnXfVAQJS1wP/Qdgn08VKwRS0IHkZCy5RIYsylZGGEfEV FCNuZgi538HjwZy6sXlOGrFmDInSRvPVhXDlzhuUbbZulzjY3iqQcZh63FqDodjd DKZ1+W1V0S0c0m0dDu/ehVi5sexIrfE289ogWKahK9iEXDGAl4AXvyWT8TP927xR Xsfb5sLUKic= =iQFc -----END PGP SIGNATURE-----
Hi All-- Ah, thanks, Barry. I will try one of these methods, probably the withlist one, and report back later today. If all goes well, I'll be able to update two FAQ entries, the one about removing archive entries and the one about importing messages/archives into your mailing list.
Metta, Ivan
Barry Warsaw wrote:
Can you just turn off mailmanctl while you're editing the inbox? Okay, this will shut down your Mailman system globally, which you might not want to do, but it's as safe as it gets.
If you specifically want to block messages just to the list your editing, then yes, lock it with bin/withlist and just edit the mbox.
Another option is to restart mailmanctl but temporarily disable the ArchRunner. You should be safe to edit the mbox file then, and just delay updating all your archives (including the mbox archives) until you've finished your surgery, while the rest of the system continues to churn away.
-- Ivan Van Laningham God N Locomotive Works http://www.pauahtun.org/ http://www.python.org/workshops/1998-11/proceedings/papers/laningham/laningh... Army Signal Corps: Cu Chi, Class of '70 Author: Teach Yourself Python in 24 Hours
participants (3)
-
Barry Warsaw
-
Ivan Van Laningham
-
Mark Sapiro