[pydotorg-www] Archives corruption

Paul Boddie paul at boddie.org.uk
Thu Jul 8 01:07:15 CEST 2010


On Wednesday 07 July 2010 15:36:20 Barry Warsaw wrote:
> On Jul 07, 2010, at 01:12 AM, Paul Boddie wrote:
> >I've been looking at the Mailman code and the Mailman.Archiver code in
> >particular, although I'm still not sure whether it makes sense to take
> >the gzipped archives from mail.python.org and try and process them in
> >some way.
>
> Probably not by itself, since the message-ids are not embedded in the html.

I was thinking of the gzipped archives linked to from the "list archives" 
page, which gives plain text mailbox files (the "Downloadable version"):

http://mail.python.org/pipermail/python-list/

But I think you're ahead of me here...

>  I think you'll want a tar of the private archives directory, so that you
> can unpack the various pickles to try to work out which message-ids are
> assigned to which sequence numbers.  The problem with that of course is
> that with a regenerated archive, those mappings won't be correct any more.

I was sort of hoping that just getting the mailbox archives and running 
pipermail (in some form) over them would give HTML archives with correct 
sequence numbers, given a suitable starting value, but I guess the various 
guarantees to make this feasible are just absent. For example, the ordering 
of the messages in the mailbox files could be different from the original 
processing order, and there may have been some HTML archiving of older 
messages after newer ones, and so on.

So, yes, it may be necessary to reverse engineer the correspondence between 
Message-Id (or something) and sequence number, as you say...

> Maybe if we knew when the regen occurred, we could get some backups and try
> to reverse engineer those mappings.

The problem was first noticed in January 2010, I think.

Paul


More information about the pydotorg-www mailing list