Folks,
I've just had to rebuild an archive for a list, thus causing
pipermail to regenerate message numbers and breaking all the links that previously used to work. I'd like to try to figure out a way to avoid that problem.
Looking at pipermail.py in processUnixMailbox, around lines
565-568, the critical code appears to be:
msgid = m.get('message-id', 'n/a')
self.message(_('#%(counter)05d %(msgid)s'))
a = self._makeArticle(m, self.sequence)
self.sequence += 1
Am I missing something here? Could we simply rip out the
references to "self.sequence" and instead drop in a call to "md5(m)"? Of course, we'd also have to adjust the initialization of "sequence" to be the empty string and make sure that all other references to it are as a string instead of a number, but I don't think that would be too hard.
Certainly, it seems that this would be an easy way to avoid
having to break all existing links because the sequence number arbitrarily changed when the archive was regenerated.
Or does Python not have an internal implementation of MD5, or
even a shim to a library function that does MD5?
-- Brad Knowles, <brad@stop.mail-abuse.org>
"Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety."
-- Benjamin Franklin (1706-1790), reply of the Pennsylvania
Assembly to the Governor, November 11, 1755
SAGE member since 1995. See <http://www.sage.org/> for more info.
Brad Knowles wrote:
I've just had to rebuild an archive for a list, thus causing pipermail to regenerate message numbers and breaking all the links that previously used to work. I'd like to try to figure out a way to avoid that problem.
Looking at pipermail.py in processUnixMailbox, around lines 565-568, the critical code appears to be:
msgid = m.get('message-id', 'n/a') self.message(_('#%(counter)05d %(msgid)s')) a = self._makeArticle(m, self.sequence) self.sequence += 1Am I missing something here? Could we simply rip out the references to "self.sequence" and instead drop in a call to "md5(m)"? Of course, we'd also have to adjust the initialization of "sequence" to be the empty string and make sure that all other references to it are as a string instead of a number, but I don't think that would be too hard.
Certainly, it seems that this would be an easy way to avoid having to break all existing links because the sequence number arbitrarily changed when the archive was regenerated.
This is not a comment on Brad's suggested hack which I actually think is a good idea which can save much grief. Rather, this is an attempt to suggest a recovery from the immediate situation.
In my admitedly limited experience, I have not seen message sequence numbers "arbitrarily" changed. Sequence numbers are assigned to messages in the order that they are read from the mbox file(s). Thus the only times they change is if:
1)messages are added to or deleted from the mbox files which can be avoided by only adding to the end of the last (current) mbox if necessary and just deleting most of the body and perhaps changing the subject instead of totally deleting the messages. There is a caveat about deleting messages in http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq03.003.htp
2)there is more than one mbox file and they are not processed in the original sequence.
3)there is more than one mbox file and when the archive was last built, the current listname.mbox/listname.mbox file was not processed last.
Note that even case 3 can be recovered by removing the messages that existed when the archive was last rebuilt and putting them in a separate mbox which is processed in the original sequence and leaving the recent messages in the listname.mbox/listname.mbox file and processing it last.
-- Mark Sapiro <msapiro@value.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
At 2:49 PM -0800 2005-02-05, Mark Sapiro wrote:
In my admitedly limited experience, I have not seen message sequence numbers "arbitrarily" changed. Sequence numbers are assigned to messages in the order that they are read from the mbox file(s). Thus the only times they change is if:
No, they're not arbitrarily changed. But they are changed if you
have to rebuild the archives, something that I have had to do several times this past week. And something that I've had to do in the past, and on which I filed a bug report which seems to have disappeared. And something that will be an issue for anyone who follows the instructions in FAQ 3.3, and has to delete an old message in the archives.
1)messages are added to or deleted from the mbox files which can be avoided by only adding to the end of the last (current) mbox if necessary and just deleting most of the body and perhaps changing the subject instead of totally deleting the messages. There is a caveat about deleting messages in http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq03.003.htp
At least a couple of times this year we have gotten requests to
remove messages from the mailman-users and mailman-developers archives, which would have broken all the existing links over the last couple of years to those lists if we had implemented the instructions in FAQ 3.3.
We're still trying to work out how we can find a way to comply
with the requests with regards to the publicly accessible version of the archives we maintain, without actually editing the source mbox files.
Note that even case 3 can be recovered by removing the messages that existed when the archive was last rebuilt and putting them in a separate mbox which is processed in the original sequence and leaving the recent messages in the listname.mbox/listname.mbox file and processing it last.
A fairly complex process to go through, and I'm still not
convinced that it would work if you had to delete a message in the archives from a couple of years ago.
I'd like to try to find a better way to solve this problem once
and for all, so that the sequence id would never be changed.
-- Brad Knowles, <brad@stop.mail-abuse.org>
"Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety."
-- Benjamin Franklin (1706-1790), reply of the Pennsylvania
Assembly to the Governor, November 11, 1755
SAGE member since 1995. See <http://www.sage.org/> for more info.
----- Original Message --------------- Brad Knowles wrote:
At 2:49 PM -0800 2005-02-05, Mark Sapiro wrote:
In my admitedly limited experience, I have not seen message sequence numbers "arbitrarily" changed. Sequence numbers are assigned to messages in the order that they are read from the mbox file(s). Thus the only times they change is if:
No, they're not arbitrarily changed. But they are changed if you have to rebuild the archives, something that I have had to do several times this past week. And something that I've had to do in the past, and on which I filed a bug report which seems to have disappeared. And something that will be an issue for anyone who follows the instructions in FAQ 3.3, and has to delete an old message in the archives.
Are you referring to http://sourceforge.net/tracker/index.php?func=detail&aid=1059566&group_id=103&atid=350103
1)messages are added to or deleted from the mbox files which can be avoided by only adding to the end of the last (current) mbox if necessary and just deleting most of the body and perhaps changing the subject instead of totally deleting the messages. There is a caveat about deleting messages in http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq03.003.htp
At least a couple of times this year we have gotten requests to remove messages from the mailman-users and mailman-developers archives, which would have broken all the existing links over the last couple of years to those lists if we had implemented the instructions in FAQ 3.3.
We're still trying to work out how we can find a way to comply with the requests with regards to the publicly accessible version of the archives we maintain, without actually editing the source mbox files.
I think if you follow the suggestion in the last paragraph of this caveat which I added to http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq03.003.htp last July, you can edit the source mbox without changing the numbering when you rebuild.
CAVEAT: If you delete entire messages from the archive two side effects occur:
Threading may be broken - if C is In-Reply-To: B which is In-Reply-To: A, and B is deleted, C will no longer be threaded with A.
Messages will be renumbered - this may be important if there are saved links to archive messages. Since the message number is part of the URI, the saved link will no longer work or will retrieve the wrong message.
To avoid these problems, instead of deleting the entire message, leave the headers intact and replace the body with "Message deleted" or some other meaningful text.
Please let me know if this is not correct.
I'd like to try to find a better way to solve this problem once and for all, so that the sequence id would never be changed.
I absolutely agree.
-- Mark Sapiro <msapiro@value.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
At 5:40 PM -0800 2005-02-05, Mark Sapiro wrote:
Are you referring to
http://sourceforge.net/tracker/index.php?func=detail&aid=1059566&group_id=103&atid=350103
Dang. I had searched for it as a bug, and had forgotten that I'd
filed it as an RFE instead. Sigh....
To avoid these problems, instead of deleting the entire message, leave the headers intact and replace the body with "Message deleted" or some other meaningful text.
Please let me know if this is not correct.
No, you've got it right.
That would help us on mailman-users and mailman-developers, for
the requests that we've gotten to delete some old messages from the archives. Except that there has been a policy decision that we don't want to edit the source mailbox itself, we want that to be kept as a permanent record of what was sent to the list. We're still trying to figure out how we're going to make that work.
However, the problem I had on the other list was that there had
been a problem on the system, and I only had February 2005 archives available, with message numbers starting at the beginning. How do you backfill the older archives (from the same source mbox file) without regenerating the message numbers?
I guess if I'd been thinking about it, I could have broken the
mbox file into two parts, one that had already been processed and the rest, and avoid blowing away the old archives while I import the previously unprocessed mbox file
But what do you do afterwards? Do you re-stitch the mbox files
together so that you can re-create the archives in the future if there is a catastrophic failure, or do you leave the mbox files permanently separated? If you do leave them permanently separated, where do you leave them and how do you make sure that the right thing happens to the right ones, as new messages are processed?
At the time, the only thing I could figure was to follow the
instructions in FAQ 3.3.
-- Brad Knowles, <brad@stop.mail-abuse.org>
"Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety."
-- Benjamin Franklin (1706-1790), reply of the Pennsylvania
Assembly to the Governor, November 11, 1755
SAGE member since 1995. See <http://www.sage.org/> for more info.
Brad Knowles wrote:
But what do you do afterwards? Do you re-stitch the mbox files together so that you can re-create the archives in the future if there is a catastrophic failure, or do you leave the mbox files permanently separated? If you do leave them permanently separated, where do you leave them and how do you make sure that the right thing happens to the right ones, as new messages are processed?
What I did with several lists for which I imported archives from prior versions of the list is this. The current mbox for the Mailman list is <list>.mbox/<list>.mbox. In addition there are older, imported archives in files in the same <list>.mbox/ directory named <list>.mbox/<list>-topica.mbox and <list>.mbox/<list>-yahoo.mbox. Then there are instructions in a Wiki detailing the archive rebuilding process and processing order.
With any luck, whoever has to rebuild the archive will refer to the Wiki.
-- Mark Sapiro <msapiro@value.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (2)
-
Brad Knowles -
Mark Sapiro