[pydotorg-www] Archives corruption (was: [PythonInfo Wiki] Update of "tftp" by

Barry Warsaw barry at python.org
Tue Jul 6 21:06:28 CEST 2010

On Jul 06, 2010, at 08:20 PM, anatoly techtonik wrote:

>So, the bug is fixed, but archives still need to be repaired with
>`cleanarch` script.
>1. Is that right?


>2. If the bug is fixed - how come that Python archives become
>corrupted? 3. If they were not corrupted - why they were regenerated?

Pipermail processes message one-at-a-time for on-the-fly archive updates.
These were never affected.  Mailman concatenates the messages into an mbox
file and the since-fixed bug broke the de-facto mbox standard message
delimiter.  While this was fixed in Mailman, old mbox files could still have
messages separated by the bogus delimiter.  Thus wiping the html archive and
regenerating it from those mbox files could produce incorrect archives.
cleanarch uses heuristics to find those broken delimiters and rewrites a new
mbox file with the fixed delimiters.

>- mbox file - some kind or binary file with messages inside
>- archive site served by Apache with some content
>    1. What is this site?
>    2. How .html files are generated (statically/dynamically)?
>    3. What are linking rules?
>    4. What are name generating formulas?
>- de-facto information loss - symptoms - broken URL links, broken
>thread chains
>Need to find out:
>- source of information loss
>- if the recovery is possible
>- recovery scenarios
>- implement recovery scenario
>- run implementation

I don't really have time to dive into all these details.  Pipermail is free
software and folks on mailman-users or mailman-developers can provide lots of
good help.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/pydotorg-www/attachments/20100706/3cad9825/attachment.pgp>

More information about the pydotorg-www mailing list