Archives corruption (was: [PythonInfo Wiki] Update of "tftp" by 79.132.252.94)
That's unacceptable. Why there is no "critical issue" about that? Where PSF is looking into? We've broken all web links from manually collected Python knowledge about 6 months ago and nobody cares. I think PSF is able to organize a dedicated sprint if its not too late to recover links provided that somebody is able describe the problem with Pipermail in sufficient details, so that people with no background can pickup and see what can be done. There is also closed `python-dev` archive on Google. Perhaps if PSF can figure out who is the owner of this archive (and if there is any content at all) - some linking information still can be recovered. -- anatoly t. On Mon, Jul 5, 2010 at 12:56 AM, Paul Boddie <paul@boddie.org.uk> wrote:
On Sunday 04 July 2010 22:41:12 anatoly techtonik wrote:
Are mail archives broken or were reindexed? I see that old reference is now dead and I doubt somebody typed the old page name by hand to make a mistake.
At some point the archives were regenerated and the identifiers in the URLs changed. Take a look at the January archives for python-dev to see some side-effects:
http://mail.python.org/pipermail/python-dev/2010-January/date.html
Here's a thread discussing the problem (or perhaps a related problem):
http://mail.python.org/pipermail/python-dev/2010-January/097388.html
Paul _______________________________________________ pydotorg-www mailing list pydotorg-www@python.org http://mail.python.org/mailman/listinfo/pydotorg-www
CC'ing Postmasters because really, this is something they at least need to be aware of, and they are in the best position to help with. Remember, they are dedicated and busy volunteers who keep critical python.org infrastructure humming away mostly care free. On Jul 05, 2010, at 10:07 AM, anatoly techtonik wrote:
That's unacceptable. Why there is no "critical issue" about that? Where PSF is looking into? We've broken all web links from manually collected Python knowledge about 6 months ago and nobody cares.
I think PSF is able to organize a dedicated sprint if its not too late to recover links provided that somebody is able describe the problem with Pipermail in sufficient details, so that people with no background can pickup and see what can be done.
There is also closed `python-dev` archive on Google. Perhaps if PSF can figure out who is the owner of this archive (and if there is any content at all) - some linking information still can be recovered.
Here's the issue. Pipermail has never maintained a database between message-ids and the urls. This is true even before Pipermail was bolted into Mailman and that's never changed, despite being high on my wish list for a decade. In any case, the problem occurs because Pipermail messages are numbered sequentially, and there is a difference between generating the archive on the fly (i.e. as messages arrive) and as a regenerated whole. This is complicated by the fact that there was a bug in Mailman years ago that broke the mbox separator so that regens couldn't be done reproducibly. This is why Mailman has a cleanarch script. The best way to regenerate a clean archive is to take the mbox file, run cleanarch over it, then run 'arch --wipe'. The urls will probably be broken, so if the original urls can be retrieved then I think the easiest way to "fix" them is to write some alias rules for Apache to do permanent redirects to the new urls. This is not a trivial amount of work, which is probably why it hasn't been done yet. Who wants to - and can - volunteer to see this through to the end? On a second level, I've been searching for volunteers to fix this wart in Pipermail for at least a decade. No one's stepped forward so far. If you're interested in helping, the Mailman project would love to have you. -Barry
On Tue, Jul 06, 2010 at 10:39:36AM -0400, Barry Warsaw wrote:
On a second level, I've been searching for volunteers to fix this wart in Pipermail for at least a decade. No one's stepped forward so far. If you're interested in helping, the Mailman project would love to have you.
Coincidentally, LWN has a similar item today: http://lwn.net/SubscriberLink/394660/a94a29378609d81a/ In 2005, the Debian project voted to declassify messages on the debian-private mailing list after a period of three years. That is easier said than done, apparently. The General Resolution (GR) calls for volunteers to do the work of declassification, and few Debian Developers seem eager to do the work required to make it happen. They can't find volunteers either, because tidying historical data is much less useful than forward-looking tasks. --amk
On Tue, Jul 6, 2010 at 5:39 PM, Barry Warsaw <barry@python.org> wrote:
Here's the issue.
Pipermail has never maintained a database between message-ids and the urls. This is true even before Pipermail was bolted into Mailman and that's never changed, despite being high on my wish list for a decade. In any case, the problem occurs because Pipermail messages are numbered sequentially, and there is a difference between generating the archive on the fly (i.e. as messages arrive) and as a regenerated whole. This is complicated by the fact that there was a bug in Mailman years ago that broke the mbox separator so that regens couldn't be done reproducibly. This is why Mailman has a cleanarch script.
So, the bug is fixed, but archives still need to be repaired with `cleanarch` script. 1. Is that right? 2. If the bug is fixed - how come that Python archives become corrupted? 3. If they were not corrupted - why they were regenerated?
The best way to regenerate a clean archive is to take the mbox file, run cleanarch over it, then run 'arch --wipe'. The urls will probably be broken, so if the original urls can be retrieved then I think the easiest way to "fix" them is to write some alias rules for Apache to do permanent redirects to the new urls. This is not a trivial amount of work, which is probably why it hasn't been done yet. Who wants to - and can - volunteer to see this through to the end?
We need to clearly define the problem first. Given: - mbox file - some kind or binary file with messages inside - archive site served by Apache with some content 1. What is this site? 2. How .html files are generated (statically/dynamically)? 3. What are linking rules? 4. What are name generating formulas? - de-facto information loss - symptoms - broken URL links, broken thread chains Need to find out: - source of information loss - if the recovery is possible - recovery scenarios Finally: - implement recovery scenario - run implementation
On a second level, I've been searching for volunteers to fix this wart in Pipermail for at least a decade. No one's stepped forward so far. If you're interested in helping, the Mailman project would love to have you.
It will be expensive to get me for the whole project. The only thing I can promise is to put some effort into this specific data transformation scenario. -- anatoly t.
anatoly techtonik wrote:
On Tue, Jul 6, 2010 at 5:39 PM, Barry Warsaw <barry@python.org> wrote:
Here's the issue.
Pipermail has never maintained a database between message-ids and the urls. This is true even before Pipermail was bolted into Mailman and that's never changed, despite being high on my wish list for a decade. In any case, the problem occurs because Pipermail messages are numbered sequentially, and there is a difference between generating the archive on the fly (i.e. as messages arrive) and as a regenerated whole. This is complicated by the fact that there was a bug in Mailman years ago that broke the mbox separator so that regens couldn't be done reproducibly. This is why Mailman has a cleanarch script.
So, the bug is fixed, but archives still need to be repaired with `cleanarch` script. 1. Is that right? 2. If the bug is fixed - how come that Python archives become corrupted? 3. If they were not corrupted - why they were regenerated?
The best way to regenerate a clean archive is to take the mbox file, run cleanarch over it, then run 'arch --wipe'. The urls will probably be broken, so if the original urls can be retrieved then I think the easiest way to "fix" them is to write some alias rules for Apache to do permanent redirects to the new urls. This is not a trivial amount of work, which is probably why it hasn't been done yet. Who wants to - and can - volunteer to see this through to the end?
We need to clearly define the problem first.
Given: - mbox file - some kind or binary file with messages inside - archive site served by Apache with some content 1. What is this site? 2. How .html files are generated (statically/dynamically)? 3. What are linking rules? 4. What are name generating formulas? - de-facto information loss - symptoms - broken URL links, broken thread chains
Need to find out: - source of information loss - if the recovery is possible - recovery scenarios
Finally: - implement recovery scenario - run implementation
On a second level, I've been searching for volunteers to fix this wart in Pipermail for at least a decade. No one's stepped forward so far. If you're interested in helping, the Mailman project would love to have you.
It will be expensive to get me for the whole project. The only thing I can promise is to put some effort into this specific data transformation scenario.
Perhaps you misunderstand the meaning of the term "volunteer". regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 DjangoCon US September 7-9, 2010 http://djangocon.us/ See Python Video! http://python.mirocommunity.org/ Holden Web LLC http://www.holdenweb.com/
On 06/07/2010 18:51, Steve Holden wrote:
[snip..]
On a second level, I've been searching for volunteers to fix this wart in Pipermail for at least a decade. No one's stepped forward so far. If you're interested in helping, the Mailman project would love to have you.
It will be expensive to get me for the whole project. The only thing I can promise is to put some effort into this specific data transformation scenario.
Perhaps you misunderstand the meaning of the term "volunteer".
That is clear beyond doubt... Michael
regards Steve
-- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog READ CAREFULLY. By accepting and reading this email you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer.
On Tue, Jul 6, 2010 at 9:16 PM, Michael Foord <mfoord@python.org> wrote:
[snip..]
On a second level, I've been searching for volunteers to fix this wart in Pipermail for at least a decade. No one's stepped forward so far. If you're interested in helping, the Mailman project would love to have you.
It will be expensive to get me for the whole project. The only thing I can promise is to put some effort into this specific data transformation scenario.
Perhaps you misunderstand the meaning of the term "volunteer".
That is clear beyond doubt...
In Russian words "have" and "own" are the same. Nobody wants to be pwnd. That's the meaning. -- anatoly t.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Am 06.07.2010 23:55, schrieb anatoly techtonik:
On Tue, Jul 6, 2010 at 9:16 PM, Michael Foord <mfoord@python.org> wrote:
[snip..]
On a second level, I've been searching for volunteers to fix this wart in Pipermail for at least a decade. No one's stepped forward so far. If you're interested in helping, the Mailman project would love to have you.
It will be expensive to get me for the whole project. The only thing I can promise is to put some effort into this specific data transformation scenario.
Perhaps you misunderstand the meaning of the term "volunteer".
That is clear beyond doubt...
In Russian words "have" and "own" are the same. Nobody wants to be pwnd. That's the meaning.
I'm afraid that this explanation doesn't make things any clearer -- for me at least... Georg -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.15 (GNU/Linux) iEYEARECAAYFAkwzsucACgkQN9GcIYhpnLC00QCgleAwUsW0tHQZrfL+i5t1FgGy cw4AmwTBxaTu4qeGXSUAdPxWOMvPsixS =KW25 -----END PGP SIGNATURE-----
On Wednesday 07 July 2010 00:49:11 Georg Brandl wrote:
I'm afraid that this explanation doesn't make things any clearer -- for me at least...
I would guess that what Anatoly is saying is that he doesn't want to be fully committed to the Mailman project or something, and that he has to earn a living doing other stuff, but then don't we all? (T-shirt suggestion: Pwned by Python?) I've noticed that the archive numbering for the problematic python-list archives does start at 000000 and 000001 but then skips around to 619310 and 627807: http://mail.python.org/pipermail/python-list/1999-February/date.html There's a good mixture of various ranges in subsequent months. I've been looking at the Mailman code and the Mailman.Archiver code in particular, although I'm still not sure whether it makes sense to take the gzipped archives from mail.python.org and try and process them in some way. Any suggestions? Paul
* Paul Boddie <paul@boddie.org.uk>:
On Wednesday 07 July 2010 00:49:11 Georg Brandl wrote:
I'm afraid that this explanation doesn't make things any clearer -- for me at least...
I would guess that what Anatoly is saying is that he doesn't want to be fully committed to the Mailman project or something, and that he has to earn a living doing other stuff, but then don't we all?
(T-shirt suggestion: Pwned by Python?)
It could be misunderstood :) -- Ralf Hildebrandt Geschäftsbereich IT | Abteilung Netzwerk Charité - Universitätsmedizin Berlin Campus Benjamin Franklin Hindenburgdamm 30 | D-12203 Berlin Tel. +49 30 450 570 155 | Fax: +49 30 450 570 962 ralf.hildebrandt@charite.de | http://www.charite.de
On Wed, Jul 7, 2010 at 2:12 AM, Paul Boddie <paul@boddie.org.uk> wrote:
(T-shirt suggestion: Pwned by Python?)
(Pthnd)
I've noticed that the archive numbering for the problematic python-list archives does start at 000000 and 000001 but then skips around to 619310 and 627807:
http://mail.python.org/pipermail/python-list/1999-February/date.html
There's a good mixture of various ranges in subsequent months.
I've been looking at the Mailman code and the Mailman.Archiver code in particular, although I'm still not sure whether it makes sense to take the gzipped archives from mail.python.org and try and process them in some way.
Any suggestions?
Before anything else: Is Pipermail a separate project from Mailman? Where to read about it? Search does nothing. If I understand correctly, the messages in mbox are stored in the order they were received. What about URL generation? Logically I would make site generator that reads one message at a time and assigns message number sequentially according to message order. Then it should analyze timestamp and thread linking attributes to understand where to put the messages. As it probably can not generate html incrementally (like inserting message that arrived later into the middle of thread html page) - it need to build some indexes. So some possible cases to test: 1. mbox somehow got sorted in different order [ ] get some mbox'es from backups and compare them 2. message counter overflow happened while building indexes [ ] check serialization/deserialization logic for message counter [ ] grep places where it is used 3. index limit overflow [ ] check limits for max messages per month/year/thread/mbox/ ... / anything else? We need to research algorithm how site generator builds indexes, sorts messages before processing and constructs indexes. But first there must be a sanity check that mbox files are intact. How can I quickstart with toolchain for converting archive? Can anybody send some initial data - mbox, point to generated site, the exact versions of installed toolchain and ensure me that 'diff' with actual downloaded versions of this toolchain is empty? -- anatoly t.
On Jul 07, 2010, at 12:16 PM, anatoly techtonik wrote:
Before anything else: Is Pipermail a separate project from Mailman?
It used to be, but it was pulled into Mailman and bolted on sometime before the 1.0 release. It ceased being a separate project at that time.
Where to read about it? Search does nothing.
UTSL.
If I understand correctly, the messages in mbox are stored in the order they were received.
Correct. http://en.wikipedia.org/wiki/Mbox
What about URL generation? Logically I would make site generator that reads one message at a time and assigns message number sequentially according to message order. Then it should analyze timestamp and thread linking attributes to understand where to put the messages. As it probably can not generate html incrementally (like inserting message that arrived later into the middle of thread html page) - it need to build some indexes.
These are stored on disk as pickles.
So some possible cases to test: 1. mbox somehow got sorted in different order [ ] get some mbox'es from backups and compare them 2. message counter overflow happened while building indexes [ ] check serialization/deserialization logic for message counter [ ] grep places where it is used 3. index limit overflow [ ] check limits for max messages per month/year/thread/mbox/ ... / anything else?
We need to research algorithm how site generator builds indexes, sorts messages before processing and constructs indexes. But first there must be a sanity check that mbox files are intact.
I don't think we modified the mbox files, perhaps other than to cleanarch them. At least I don't remember doing anything like that. Theoretically, if the message sequences in the mbox file were identical to the on-the-fly generation of the html, then the sequence numbers should be the same too. The problem is that cleanarch relies on heuristics which can sometimes be incorrect. I'm also not sure whether cleanarch was run on the mbox file before the regen occurred.
How can I quickstart with toolchain for converting archive? Can anybody send some initial data - mbox, point to generated site, the exact versions of installed toolchain and ensure me that 'diff' with actual downloaded versions of this toolchain is empty?
We could make a tar of the entire private archive directory, which probably includes all the raw data you need. If anybody objects to making this available to anatoly, please let me know. -Barry
On Wed, Jul 07, 2010, Barry Warsaw wrote:
On Jul 07, 2010, at 12:16 PM, anatoly techtonik wrote:
Where to read about it? Search does nothing.
UTSL.
Just for the record because I'm unfond of obscure acronyms in serious responses: "Use The Source, Luke" Now, if *that* isn't familiar, JFGI. ;-) -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "If you don't know what your program is supposed to do, you'd better not start writing it." --Dijkstra
On 07/07/2010 16:24, Aahz wrote:
On Wed, Jul 07, 2010, Barry Warsaw wrote:
On Jul 07, 2010, at 12:16 PM, anatoly techtonik wrote:
Where to read about it? Search does nothing.
UTSL.
Just for the record because I'm unfond of obscure acronyms in serious responses: "Use The Source, Luke"
Now, if *that* isn't familiar, JFGI. ;-)
"Just ForGet it"? Michael -- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog READ CAREFULLY. By accepting and reading this email you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer.
On Jul 07, 2010, at 01:12 AM, Paul Boddie wrote:
I've been looking at the Mailman code and the Mailman.Archiver code in particular, although I'm still not sure whether it makes sense to take the gzipped archives from mail.python.org and try and process them in some way.
Probably not by itself, since the message-ids are not embedded in the html. I think you'll want a tar of the private archives directory, so that you can unpack the various pickles to try to work out which message-ids are assigned to which sequence numbers. The problem with that of course is that with a regenerated archive, those mappings won't be correct any more. Maybe if we knew when the regen occurred, we could get some backups and try to reverse engineer those mappings. yeah-it-sucks-ly y'rs, -Barry
On Wed, Jul 07, 2010 at 09:36:20AM -0400, Barry Warsaw wrote:
Probably not by itself, since the message-ids are not embedded in the html. I think you'll want a tar of the private archives directory, so that you can unpack the various pickles to try to work out which message-ids are assigned to which sequence numbers. The problem with that of course is that with a regenerated archive, those mappings won't be correct any more.
Note that the internal threading IDs *are* embedded in the HTML for thread indexes: <!--0 01277935270- --> <LI><A HREF="101252.html">[Python-Dev] OS X buildbots: why am I skipping these tests? </A><A NAME="101252"> </A> <I>"Martin v. Löwis" </I> <UL> <!--1 01277935270-01277935581- --> <LI><A HREF="101253.html">[Python-Dev] OS X buildbots: why am I skipping these tests? </A><A NAME="101253"> </A> <I>Brett Cannon </I> When quoting e-mails, Linux Weekly News includes the entire e-mail in their CMS. Maybe something similar could be done for PEPs, providing a way to store and attach the entire e-mail giving a decision. --amk
On Wednesday 07 July 2010 15:36:20 Barry Warsaw wrote:
On Jul 07, 2010, at 01:12 AM, Paul Boddie wrote:
I've been looking at the Mailman code and the Mailman.Archiver code in particular, although I'm still not sure whether it makes sense to take the gzipped archives from mail.python.org and try and process them in some way.
Probably not by itself, since the message-ids are not embedded in the html.
I was thinking of the gzipped archives linked to from the "list archives" page, which gives plain text mailbox files (the "Downloadable version"): http://mail.python.org/pipermail/python-list/ But I think you're ahead of me here...
I think you'll want a tar of the private archives directory, so that you can unpack the various pickles to try to work out which message-ids are assigned to which sequence numbers. The problem with that of course is that with a regenerated archive, those mappings won't be correct any more.
I was sort of hoping that just getting the mailbox archives and running pipermail (in some form) over them would give HTML archives with correct sequence numbers, given a suitable starting value, but I guess the various guarantees to make this feasible are just absent. For example, the ordering of the messages in the mailbox files could be different from the original processing order, and there may have been some HTML archiving of older messages after newer ones, and so on. So, yes, it may be necessary to reverse engineer the correspondence between Message-Id (or something) and sequence number, as you say...
Maybe if we knew when the regen occurred, we could get some backups and try to reverse engineer those mappings.
The problem was first noticed in January 2010, I think. Paul
On 07/06/2010 05:49 PM, Georg Brandl wrote:
Am 06.07.2010 23:55, schrieb anatoly techtonik:
On Tue, Jul 6, 2010 at 9:16 PM, Michael Foord <mfoord@python.org> wrote:
Pipermail for at least a decade. No one's stepped forward so far. If you're interested in helping, the Mailman project would love to have you.
In Russian words "have" and "own" are the same. Nobody wants to be pwnd. That's the meaning.
I'm afraid that this explanation doesn't make things any clearer -- for me at least...
Georg, the "have" is referring to the word used by Michael, as in: "If you're interesting in helping, the Mailman project would love to -own- you." with a (perfectly natural) reaction of, "hey, I'm willing to contribute but -not- to take on this big task all by myself!" Basically its much ado about nothing, a language misunderstanding by all. Let's move on. -Jeff
anatoly techtonik wrote:
On Tue, Jul 6, 2010 at 5:39 PM, Barry Warsaw <barry@python.org> wrote:
Here's the issue.
Pipermail has never maintained a database between message-ids and the urls. This is true even before Pipermail was bolted into Mailman and that's never changed, despite being high on my wish list for a decade. In any case, the problem occurs because Pipermail messages are numbered sequentially, and there is a difference between generating the archive on the fly (i.e. as messages arrive) and as a regenerated whole. This is complicated by the fact that there was a bug in Mailman years ago that broke the mbox separator so that regens couldn't be done reproducibly. This is why Mailman has a cleanarch script.
So, the bug is fixed, but archives still need to be repaired with `cleanarch` script. 1. Is that right? 2. If the bug is fixed - how come that Python archives become corrupted? 3. If they were not corrupted - why they were regenerated?
Note that for some Python email lists the mbox file contains messages for which we received and acted on take down notices. I'm fairly sure we removed the message from the html archive but not from the mbox. So regenerating from those will cause re-posting of those messages. Perhaps not a big deal (it's years ago) but thought I'd mention it.
On a second level, I've been searching for volunteers to fix this wart in Pipermail for at least a decade. No one's stepped forward so far. If you're interested in helping, the Mailman project would love to have you.
It will be expensive to get me for the whole project. The only thing I can promise is to put some effort into this specific data transformation scenario.
"Volunteer" means doing it for free as a service to the community. What is needed is actual help not attempts to flog or embarrass others into doing work that you think is important. I think many agree it's a problem but frankly it's a bit odd that in response to a call for volunteers to fix the problem you essentially say "I can help but it'll cost you a lot". Hopefully I'm just misunderstanding your email. - Stephan
On Jul 06, 2010, at 08:20 PM, anatoly techtonik wrote:
So, the bug is fixed, but archives still need to be repaired with `cleanarch` script. 1. Is that right?
Yes.
2. If the bug is fixed - how come that Python archives become corrupted? 3. If they were not corrupted - why they were regenerated?
Pipermail processes message one-at-a-time for on-the-fly archive updates. These were never affected. Mailman concatenates the messages into an mbox file and the since-fixed bug broke the de-facto mbox standard message delimiter. While this was fixed in Mailman, old mbox files could still have messages separated by the bogus delimiter. Thus wiping the html archive and regenerating it from those mbox files could produce incorrect archives. cleanarch uses heuristics to find those broken delimiters and rewrites a new mbox file with the fixed delimiters.
Given: - mbox file - some kind or binary file with messages inside - archive site served by Apache with some content 1. What is this site? 2. How .html files are generated (statically/dynamically)? 3. What are linking rules? 4. What are name generating formulas? - de-facto information loss - symptoms - broken URL links, broken thread chains
Need to find out: - source of information loss - if the recovery is possible - recovery scenarios
Finally: - implement recovery scenario - run implementation
I don't really have time to dive into all these details. Pipermail is free software and folks on mailman-users or mailman-developers can provide lots of good help. -Barry
On Tuesday 06 July 2010 16:39:36 Barry Warsaw wrote:
Here's the issue.
Pipermail has never maintained a database between message-ids and the urls. This is true even before Pipermail was bolted into Mailman and that's never changed, despite being high on my wish list for a decade. In any case, the problem occurs because Pipermail messages are numbered sequentially, and there is a difference between generating the archive on the fly (i.e. as messages arrive) and as a regenerated whole. This is complicated by the fact that there was a bug in Mailman years ago that broke the mbox separator so that regens couldn't be done reproducibly. This is why Mailman has a cleanarch script.
Thanks for the summary! I knew it had something to do with that thread I referenced, but I didn't really put all the pieces together.
The best way to regenerate a clean archive is to take the mbox file, run cleanarch over it, then run 'arch --wipe'. The urls will probably be broken, so if the original urls can be retrieved then I think the easiest way to "fix" them is to write some alias rules for Apache to do permanent redirects to the new urls. This is not a trivial amount of work, which is probably why it hasn't been done yet. Who wants to - and can - volunteer to see this through to the end?
Is it not possible to get an old version of Mailman to generate archives which presumably have the same traits as those previously generated, record the identifier to Message-Id (or other "anchoring" property) correspondence, and then relabel the messages in the fixed archives with the old identifiers? Or does the whole "as messages arrive" thing completely prevent any possibility of reproducing the correct archived message ordering?
On a second level, I've been searching for volunteers to fix this wart in Pipermail for at least a decade. No one's stepped forward so far. If you're interested in helping, the Mailman project would love to have you.
It sounds like a fun project, and I'm tempted, but I also have a fair amount of other stuff to do right now, including writing a talk for EuroPython, although this might make for some interesting material for that talk. ;-) Paul
Trying to catch up on some old threads... On Jul 06, 2010, at 09:32 PM, Paul Boddie wrote:
The best way to regenerate a clean archive is to take the mbox file, run cleanarch over it, then run 'arch --wipe'. The urls will probably be broken, so if the original urls can be retrieved then I think the easiest way to "fix" them is to write some alias rules for Apache to do permanent redirects to the new urls. This is not a trivial amount of work, which is probably why it hasn't been done yet. Who wants to - and can - volunteer to see this through to the end?
Is it not possible to get an old version of Mailman to generate archives which presumably have the same traits as those previously generated, record the identifier to Message-Id (or other "anchoring" property) correspondence, and then relabel the messages in the fixed archives with the old identifiers? Or does the whole "as messages arrive" thing completely prevent any possibility of reproducing the correct archived message ordering?
I think it will be problematic with an archive as old as python-dev. It's worth a shot of course <wink>, but python-dev's mbox definitely spans the problematic region and cleanarch is just a heuristic. The on-demand archive generation is different enough (even though it uses much of the same code path) that I'm not positive it's stable even without the mbox bug, and it will be difficult to verify. I just don't have the cycles to do much testing of this, but I'll answer questions for anyone who does.
On a second level, I've been searching for volunteers to fix this wart in Pipermail for at least a decade. No one's stepped forward so far. If you're interested in helping, the Mailman project would love to have you.
It sounds like a fun project, and I'm tempted, but I also have a fair amount of other stuff to do right now, including writing a talk for EuroPython, although this might make for some interesting material for that talk. ;-)
Hope that went well! -Barry
participants (12)
-
A.M. Kuchling -
Aahz -
anatoly techtonik -
Barry Warsaw -
Dirkjan Ochtman -
Georg Brandl -
Jeff Rush -
Michael Foord -
Paul Boddie -
Ralf Hildebrandt -
Stephan Deibel -
Steve Holden