[pydotorg-www] Archives corruption

anatoly techtonik techtonik at gmail.com
Wed Jul 7 11:16:21 CEST 2010

On Wed, Jul 7, 2010 at 2:12 AM, Paul Boddie <paul at boddie.org.uk> wrote:
> (T-shirt suggestion: Pwned by Python?)


> I've noticed that the archive numbering for the problematic python-list
> archives does start at 000000 and 000001 but then skips around to 619310 and
> 627807:
> http://mail.python.org/pipermail/python-list/1999-February/date.html
> There's a good mixture of various ranges in subsequent months.
> I've been looking at the Mailman code and the Mailman.Archiver code in
> particular, although I'm still not sure whether it makes sense to take the
> gzipped archives from mail.python.org and try and process them in some way.
> Any suggestions?

Before anything else: Is Pipermail a separate project from Mailman?
Where to read about it? Search does nothing.

If I understand correctly, the messages in mbox are stored in the
order they were received. What about URL generation? Logically I would
make site generator that reads one message at a time and assigns
message number sequentially according to message order. Then it should
analyze timestamp and thread linking attributes to understand where to
put the messages. As it probably can not generate html incrementally
(like inserting message that arrived later into the middle of thread
html page) - it need to build some indexes.

So some possible cases to test:
1. mbox somehow got sorted in different order
  [ ] get some mbox'es from backups and compare them
2. message counter overflow happened while building indexes
  [ ] check serialization/deserialization logic for message counter
  [ ] grep places where it is used
3. index limit overflow
  [ ] check limits for max messages per
        month/year/thread/mbox/ ... / anything else?

We need to research algorithm how site generator builds indexes, sorts
messages before processing and constructs indexes. But first there
must be a sanity check that mbox files are intact.

How can I quickstart with toolchain for converting archive?
Can anybody send some initial data - mbox, point to generated site,
the exact versions of installed toolchain and ensure me that 'diff'
with actual downloaded versions of this toolchain is empty?

anatoly t.

More information about the pydotorg-www mailing list