[pydotorg-www] Archives corruption (was: [PythonInfo Wiki] Update of "tftp" by 184.108.40.206)
steve at holdenweb.com
Tue Jul 6 19:51:25 CEST 2010
anatoly techtonik wrote:
> On Tue, Jul 6, 2010 at 5:39 PM, Barry Warsaw <barry at python.org> wrote:
>> Here's the issue.
>> Pipermail has never maintained a database between message-ids and the urls.
>> This is true even before Pipermail was bolted into Mailman and that's never
>> changed, despite being high on my wish list for a decade. In any case, the
>> problem occurs because Pipermail messages are numbered sequentially, and there
>> is a difference between generating the archive on the fly (i.e. as messages
>> arrive) and as a regenerated whole. This is complicated by the fact that
>> there was a bug in Mailman years ago that broke the mbox separator so that
>> regens couldn't be done reproducibly. This is why Mailman has a cleanarch
> So, the bug is fixed, but archives still need to be repaired with
> `cleanarch` script.
> 1. Is that right?
> 2. If the bug is fixed - how come that Python archives become corrupted?
> 3. If they were not corrupted - why they were regenerated?
>> The best way to regenerate a clean archive is to take the mbox file, run
>> cleanarch over it, then run 'arch --wipe'. The urls will probably be broken,
>> so if the original urls can be retrieved then I think the easiest way to "fix"
>> them is to write some alias rules for Apache to do permanent redirects to the
>> new urls. This is not a trivial amount of work, which is probably why it
>> hasn't been done yet. Who wants to - and can - volunteer to see this through
>> to the end?
> We need to clearly define the problem first.
> - mbox file - some kind or binary file with messages inside
> - archive site served by Apache with some content
> 1. What is this site?
> 2. How .html files are generated (statically/dynamically)?
> 3. What are linking rules?
> 4. What are name generating formulas?
> - de-facto information loss - symptoms - broken URL links, broken thread chains
> Need to find out:
> - source of information loss
> - if the recovery is possible
> - recovery scenarios
> - implement recovery scenario
> - run implementation
>> On a second level, I've been searching for volunteers to fix this wart in
>> Pipermail for at least a decade. No one's stepped forward so far. If you're
>> interested in helping, the Mailman project would love to have you.
> It will be expensive to get me for the whole project. The only thing I
> can promise is to put some effort into this specific data
> transformation scenario.
Perhaps you misunderstand the meaning of the term "volunteer".
Steve Holden +1 571 484 6266 +1 800 494 3119
DjangoCon US September 7-9, 2010 http://djangocon.us/
See Python Video! http://python.mirocommunity.org/
Holden Web LLC http://www.holdenweb.com/
More information about the pydotorg-www