[pydotorg-www] Archives corruption

Wed Jul 7 15:44:07 CEST 2010

On Jul 07, 2010, at 12:16 PM, anatoly techtonik wrote:

>Before anything else: Is Pipermail a separate project from Mailman?

It used to be, but it was pulled into Mailman and bolted on sometime before
the 1.0 release.  It ceased being a separate project at that time.

>Where to read about it? Search does nothing.

UTSL.

>If I understand correctly, the messages in mbox are stored in the
>order they were received.

Correct.

http://en.wikipedia.org/wiki/Mbox

>What about URL generation? Logically I would
>make site generator that reads one message at a time and assigns
>message number sequentially according to message order. Then it should
>analyze timestamp and thread linking attributes to understand where to
>put the messages. As it probably can not generate html incrementally
>(like inserting message that arrived later into the middle of thread
>html page) - it need to build some indexes.

These are stored on disk as pickles.

>So some possible cases to test:
>1. mbox somehow got sorted in different order
>  [ ] get some mbox'es from backups and compare them
>2. message counter overflow happened while building indexes
>  [ ] check serialization/deserialization logic for message counter
>  [ ] grep places where it is used
>3. index limit overflow
>  [ ] check limits for max messages per
>        month/year/thread/mbox/ ... / anything else?
>
>
>We need to research algorithm how site generator builds indexes, sorts
>messages before processing and constructs indexes. But first there
>must be a sanity check that mbox files are intact.

I don't think we modified the mbox files, perhaps other than to cleanarch
them.  At least I don't remember doing anything like that.  Theoretically, if
the message sequences in the mbox file were identical to the on-the-fly
generation of the html, then the sequence numbers should be the same too.  The
problem is that cleanarch relies on heuristics which can sometimes be
incorrect.  I'm also not sure whether cleanarch was run on the mbox file
before the regen occurred.

>How can I quickstart with toolchain for converting archive?
>Can anybody send some initial data - mbox, point to generated site,
>the exact versions of installed toolchain and ensure me that 'diff'
>with actual downloaded versions of this toolchain is empty?

We could make a tar of the entire private archive directory, which probably
includes all the raw data you need.  If anybody objects to making this
available to anatoly, please let me know.

-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/pydotorg-www/attachments/20100707/862c368c/attachment.pgp>