[Mailman-Developers] Improving the archives

Fri Jul 20 16:07:48 CEST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 20, 2007, at 9:31 AM, Stephen J. Turnbull wrote:

> Barry Warsaw writes:
>
>> Second, things can happen to a list
>> that might cause this sequence number to get corrupted.
>
> Add an X-Mailman-Sequence-Number header if not already present.
>
> That doesn't deal with your other comments, but as I point out
> elsewhere, if you don't use *any* Mailman-specific information in the
> global ID, you have no sane way to handle collisions except throw them
> away (or make the global ID refer to a collection resource, but that's
> kinda unintuitive).

I'd probably call it X-List-Sequence-Number and I'd have to ensure  
that archive copy had that header in it.  OTOH, if I'm going to go to  
the trouble of adding this sequence number, why not just calculate a  
(more likely) gid for the message myself?  If I did that, I could use  
a tinyurl scheme and get much shorter urls.  The archiver would then  
be obliged to use my X-List-GID header verbatim.

I've been pushing for calculating this using non-Mailman headers  
because I'd /like/ for a client receiving the non-list copy to be  
able to make the same calculation.  OTOH, maybe we can have it both  
ways.

So, we calculate the sequence number and generate the following headers:

X-List-Sequence-Number: 801
X-List-Message-GID: RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI

The latter is composed of purely author generated data, the former is  
supplied by Mailman.

Assuming we also had this header:

List-Archive: http://archive.example.com/gid/

then the following url would point to the same exact resource:

http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI
http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI/801

If however we subsequently got a collision, then these two urls would  
address different resources.  E.g.:

X-List-Sequence-Number: 2112
X-List-Message-GID: RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI

Now the two messages would still be addressable by their respective  
urls:

http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI/801
http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI/2112

but

http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI

would be a disambiguation page.  For a web u/i it would be an HTML  
list containing relative links to '801' and '2112'.  A RESTful XML  
document would contain the set of links to the subordinate pages.  A  
client of the archive.example.com service would have to be prepared  
to handle disambiguation pages if it used only the author generated  
GID, but it would be guaranteed that the full url would lead directly  
to one and only one email message.

Archives would have to recognize the X-List-Sequence-Number and honor  
it whenever it regenerated its archives so that the urls would remain  
stable.

Thinking about this more (and I've been up since about 3:30am so I'm  
a little foggy right now ;), we may want to optimize for fewer dupes  
rather than fewer collisions, or maybe it doesn't matter.  It would  
be interesting to see how big the message-id buckets are when only  
using the Message-ID header.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqDBtHEjvBPtnXfVAQLOggQAhIjxlU2jPDb5K8Lfe3NThjgwKiPblqtm
UurUj+AZCffS1ewGDlV6y3GGRnHEzdVSIVvAiATEGTRVG8Zzbbev3GXs0EKYiEyL
FZreNcPqDAPL0KSGw73RdAiwZuszfQcMTsSwOx98zS9Kz0NtbntYQTuqQZwo7wAW
3KeGe2PkpaI=
=yhaZ
-----END PGP SIGNATURE-----