[pydotorg-www] Archives corruption (was: [PythonInfo Wiki] Update of "tftp" by 79.132.252.94)

anatoly techtonik techtonik at gmail.com
Tue Jul 6 19:20:14 CEST 2010


On Tue, Jul 6, 2010 at 5:39 PM, Barry Warsaw <barry at python.org> wrote:
>
> Here's the issue.
>
> Pipermail has never maintained a database between message-ids and the urls.
> This is true even before Pipermail was bolted into Mailman and that's never
> changed, despite being high on my wish list for a decade.  In any case, the
> problem occurs because Pipermail messages are numbered sequentially, and there
> is a difference between generating the archive on the fly (i.e. as messages
> arrive) and as a regenerated whole.  This is complicated by the fact that
> there was a bug in Mailman years ago that broke the mbox separator so that
> regens couldn't be done reproducibly.   This is why Mailman has a cleanarch
> script.

So, the bug is fixed, but archives still need to be repaired with
`cleanarch` script.
1. Is that right?
2. If the bug is fixed - how come that Python archives become corrupted?
3. If they were not corrupted - why they were regenerated?

> The best way to regenerate a clean archive is to take the mbox file, run
> cleanarch over it, then run 'arch --wipe'.  The urls will probably be broken,
> so if the original urls can be retrieved then I think the easiest way to "fix"
> them is to write some alias rules for Apache to do permanent redirects to the
> new urls.  This is not a trivial amount of work, which is probably why it
> hasn't been done yet.  Who wants to - and can - volunteer to see this through
> to the end?

We need to clearly define the problem first.

Given:
- mbox file - some kind or binary file with messages inside
- archive site served by Apache with some content
    1. What is this site?
    2. How .html files are generated (statically/dynamically)?
    3. What are linking rules?
    4. What are name generating formulas?
- de-facto information loss - symptoms - broken URL links, broken thread chains

Need to find out:
- source of information loss
- if the recovery is possible
- recovery scenarios

Finally:
- implement recovery scenario
- run implementation

> On a second level, I've been searching for volunteers to fix this wart in
> Pipermail for at least a decade.  No one's stepped forward so far.  If you're
> interested in helping, the Mailman project would love to have you.

It will be expensive to get me for the whole project. The only thing I
can promise is to put some effort into this specific data
transformation scenario.

-- 
anatoly t.


More information about the pydotorg-www mailing list