[pydotorg-www] Archives corruption (was: [PythonInfo Wiki] Update of "tftp" by

Steve Holden steve at holdenweb.com
Tue Jul 6 19:51:25 CEST 2010

anatoly techtonik wrote:
> On Tue, Jul 6, 2010 at 5:39 PM, Barry Warsaw <barry at python.org> wrote:
>> Here's the issue.
>> Pipermail has never maintained a database between message-ids and the urls.
>> This is true even before Pipermail was bolted into Mailman and that's never
>> changed, despite being high on my wish list for a decade.  In any case, the
>> problem occurs because Pipermail messages are numbered sequentially, and there
>> is a difference between generating the archive on the fly (i.e. as messages
>> arrive) and as a regenerated whole.  This is complicated by the fact that
>> there was a bug in Mailman years ago that broke the mbox separator so that
>> regens couldn't be done reproducibly.   This is why Mailman has a cleanarch
>> script.
> So, the bug is fixed, but archives still need to be repaired with
> `cleanarch` script.
> 1. Is that right?
> 2. If the bug is fixed - how come that Python archives become corrupted?
> 3. If they were not corrupted - why they were regenerated?
>> The best way to regenerate a clean archive is to take the mbox file, run
>> cleanarch over it, then run 'arch --wipe'.  The urls will probably be broken,
>> so if the original urls can be retrieved then I think the easiest way to "fix"
>> them is to write some alias rules for Apache to do permanent redirects to the
>> new urls.  This is not a trivial amount of work, which is probably why it
>> hasn't been done yet.  Who wants to - and can - volunteer to see this through
>> to the end?
> We need to clearly define the problem first.
> Given:
> - mbox file - some kind or binary file with messages inside
> - archive site served by Apache with some content
>     1. What is this site?
>     2. How .html files are generated (statically/dynamically)?
>     3. What are linking rules?
>     4. What are name generating formulas?
> - de-facto information loss - symptoms - broken URL links, broken thread chains
> Need to find out:
> - source of information loss
> - if the recovery is possible
> - recovery scenarios
> Finally:
> - implement recovery scenario
> - run implementation
>> On a second level, I've been searching for volunteers to fix this wart in
>> Pipermail for at least a decade.  No one's stepped forward so far.  If you're
>> interested in helping, the Mailman project would love to have you.
> It will be expensive to get me for the whole project. The only thing I
> can promise is to put some effort into this specific data
> transformation scenario.
Perhaps you misunderstand the meaning of the term "volunteer".

Steve Holden           +1 571 484 6266   +1 800 494 3119
DjangoCon US September 7-9, 2010    http://djangocon.us/
See Python Video!       http://python.mirocommunity.org/
Holden Web LLC                 http://www.holdenweb.com/

More information about the pydotorg-www mailing list