[Mailman-Users] Integration with external search engine

Tue Dec 21 23:33:56 CET 2010

On 12/18/2010 5:45 AM, Lukáš Vlček wrote:
> 
> On Sat, Dec 18, 2010 at 4:31 AM, Mark Sapiro <mark at msapiro.net
> <mailto:mark at msapiro.net>> wrote:
> 
>     find /path/to/archives/private/LISTNAME \
>      | egrep "[0-9]{6}.html" \
>      | sed "s;.*archives/private;http://www.example.com/pipermail;"
> 
>     with the obvious modification will get the URLs. Will that be enough?
> 
> 
> Not exactly. I need to index mail list content by external search server
> and for each indexed mail I need to know working mailman public URL of
> that mail.

The above shell command will get you a list of the URLs. If you are
saying you need to know the message content together with the URL, you
could still do this easily from the existing pipermail archive. The
point is that each individual message in the archive is in a file of the
form archives/private/LISTNAME/yyyy-Month/nnnnnn.html and the
LISTNAME/yyyy-Month/nnnnnn.html portion of that path is also the
variable part of the URL used to access the message.

> My question is: if I take the <list-name>.mbox file is there any way how
> I can deduce working URL of individual emails?
> Say I can split the mbox file using:
> csplit -s -b %06d.mbox -z <list-name>.mbox '/^From /' {*}
> into individual emails. Would the numbering be the same as the one
> produced by mailman in this case? (Providing mailman numbering starts
> from zero)

It will be the same as the numbering produced by running bin/arch
--wipe. As you note below, this is not guaranteed to be the same as that
in the existing archive.

> I learned that if I use this csplit technique with public archives then
> the numbering is not guarantied to match (the order in which the mails
> are stored in public archives does not match the numbering order of
> mailman produced HTML files). Moreover public archive files do not
> contain all the email headers (charset, encoding, content-type, ...) and
> I don't want to index generated HTML files for now.

If you really need information from the cummulative .mbox which is not
available in the existing pipermail html files, I see two choices.

If you don't want to rebuild the pipermail archive and possibly renumber
messages, you will need to develop some script to go through the .mbox
and parse the archive period (year/month or whatever the period is in
your case) from the messages and search the nnnn.html files in that
directory for a match.

If you don't mind possibly renumbering messages, you could first check
the .mbox with bin/cleanarch and then rebuild the archive from the .mbox
with bin/arch --wipe, and then your csplit above will give the correct
new numbers.

Before rebuilding the archive however, you might check if the numbering
in the mbox really doesn't match. While it is not guaranteed to match,
it often does, particularly if the archive is not too old - i.e., if the
oldest messages were archived by Mailman 2.1.x and not 2.0.x or older.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan