[pydotorg-www] Archives corruption (was: [PythonInfo Wiki] Update of "tftp" by 79.132.252.94)

Barry Warsaw barry at python.org
Tue Jul 6 16:39:36 CEST 2010


CC'ing Postmasters because really, this is something they at least need to be
aware of, and they are in the best position to help with.  Remember, they are
dedicated and busy volunteers who keep critical python.org infrastructure
humming away mostly care free.

On Jul 05, 2010, at 10:07 AM, anatoly techtonik wrote:

>That's unacceptable. Why there is no "critical issue" about that?
>Where PSF is looking into? We've broken all web links from manually
>collected Python knowledge about 6 months ago and nobody cares.
>
>I think PSF is able to organize a dedicated sprint if its not too late
>to recover links provided that somebody is able describe the problem
>with Pipermail in sufficient details, so that people with no
>background can pickup and see what can be done.
>
>There is also closed `python-dev` archive on Google. Perhaps if PSF
>can figure out who is the owner of this archive (and if there is any
>content at all)  - some linking information still can be recovered.

Here's the issue.

Pipermail has never maintained a database between message-ids and the urls.
This is true even before Pipermail was bolted into Mailman and that's never
changed, despite being high on my wish list for a decade.  In any case, the
problem occurs because Pipermail messages are numbered sequentially, and there
is a difference between generating the archive on the fly (i.e. as messages
arrive) and as a regenerated whole.  This is complicated by the fact that
there was a bug in Mailman years ago that broke the mbox separator so that
regens couldn't be done reproducibly.   This is why Mailman has a cleanarch
script.

The best way to regenerate a clean archive is to take the mbox file, run
cleanarch over it, then run 'arch --wipe'.  The urls will probably be broken,
so if the original urls can be retrieved then I think the easiest way to "fix"
them is to write some alias rules for Apache to do permanent redirects to the
new urls.  This is not a trivial amount of work, which is probably why it
hasn't been done yet.  Who wants to - and can - volunteer to see this through
to the end?

On a second level, I've been searching for volunteers to fix this wart in
Pipermail for at least a decade.  No one's stepped forward so far.  If you're
interested in helping, the Mailman project would love to have you.

-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/pydotorg-www/attachments/20100706/fdff80fe/attachment.pgp>


More information about the pydotorg-www mailing list