[Mailman-Users] Integration with external search engine

Mark Sapiro mark at msapiro.net
Wed Dec 22 18:34:57 CET 2010

On 12/22/2010 5:26 AM, Lukáš Vlček wrote:
> What is the Mailman algorithm to number individual HTML representations
> of mails?

Sequential in order of arrival.

> My understanding was that once the new mail is received by Mailman then
> it is processed, appended to mbox accumulated file and put into
> private/public archive folder (i.e. HTML representation is rendered and
> stored on the disk). If the flow is that smooth then the numbering would
> really match the order of individual messages in accumulated mbox file.

This is correct. Further, the list is locked during this process so even
with "simultaneous" arrival of two messages to be archived, the order in
the .mbox should match the sequence in the pipermail archive.

> May be if the new message has to undergo admin moderation then this can
> influence the result numbering (resulting in numbering gaps?), but I am
> just speculating here...

No. It is not archived until after moderator approval.

> Do you think you could shed more light on the numbering process?
> To me it seems unfortunate that there is really no simple way how to
> determine valid URL for individual mails in mbox file. 

The number in the archive *should* match the sequence in the .mbox. The
reasons why it doesn't include manual editing of the .mbox file, running
bin/arch to add messages to the archive without adding them in the same
sequence to the .mbox file, and messages with embedded, unescaped "^From
" lines in the body.

>     If you don't want to rebuild the pipermail archive and possibly renumber
>     messages, you will need to develop some script to go through the .mbox
>     and parse the archive period (year/month or whatever the period is in
>     your case) from the messages and search the nnnn.html files in that
>     directory for a match.
> Search for the match using Message-ID value?
> Message-ID is not always present in HTML version, is it? All I can see
> is that the Message-ID value is encoded into mailto: link as a
> In-Reply-To value. Other than that some advanced heuristics would have
> been used...

In Mailman 2.1.10 and later, the mailto: always contains the message-id
of this message in the In-Reply-To fragment. Prior to 2.1.10 there was
not always a message-id in the mailto: and if there was, it was not the
message-id of this message but rather the in-reply-to of this message.

I suggest you simply test your .mbox file to see if the sequence numbers
you generate from the From_ lines match those in the archive. As long as
you have not manually manipulated the .mbox or merged separate .mbox
files, there's a good chance this will be OK. You don't have to check
every single message. If the numbering is off, there will be places
where the numbering jumps from being correct to "off by one" and then to
"off by two", etc. I.e., I don't think you have to worry about things
like an mbox sequence of n, n+1, n+2, n+3, ... corresponding to an
archive sequence of n, n+2, n+1, n+3, ... See the FAQ at
<http://wiki.list.org/x/RIA9> for a description of what happened to this
list when the archive was rebuilt in 2006.

Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan

More information about the Mailman-Users mailing list