[Mailman-Users] Gmail "features"
brad at shub-internet.org
Thu Aug 9 19:05:02 CEST 2012
On Aug 8, 2012, at 11:11 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Well, unfortunately Gmail is closed-source and I don't know what the
> full algorithm is. Surely Message-Id is part of it, but evidently
> there are other aspects to it, or the behavior you and Brad
> R. describe wouldn't happen.
In the large-scale mail system design that I've done in the past, the tuple of (sender,recipient,message-id) was considered to be a pretty good index key for the mail database, albeit not a guaranteed unique key. Most greylisting implementations use a tuple of (sender,recipient,sending-IP) to determine if this particular message should be delayed or not.
I even did a single-instance-store message database design that did an SHA-1 hash of the message body content to see if the message contents really were unique, and if not then you could store the headers separate from the body and for the body you could just include a pointer to the existing message body that you already have. I believe that some implementations of Microsoft Exchange implement a similar algorithm.
If you wanted to go to the extreme, you could de-compose each message to the individual MIME bodyparts, and then do an SHA-1 hash on each of those. So, no matter how many copies of the latest Dilbert cartoon get mailed out, and no matter what text or other material might surround that, you'd still be able to reduce that to storing just one copy of the cartoon with multiple inbound links.
On the other hand, Nick Christensen (author of "Sendmail Performance Tuning", ISBN-13: 978-0321115706) and I discovered that you would be trading more disk I/O operations in order to try to save a relatively trivial amount of disk space, and that's the exact opposite of the trade-off you want to make given the way disk storage capacities have rapidly grown while I/O capacities have been relatively stagnant. We discussed all these issues in the invited talk "Design and Implementation of Highly Scalable E-mail Systems", see <http://www.shub-internet.org/brad/papers/dihses/>.
I happen to know the former SRE for gmail, but I don't think he'd be able to tell me anything useful on this subject.
I really don't think that this is a disk storage issue, I think this is much more likely to be a wrong-headed idea that this kind of thing will be beneficial to the users -- after all, they know that they sent the message and that copy is sitting in the outbox, so they don't need to have another copy sitting in the inbox.
And maybe for the majority of users, that decision might actually be helpful. But they need to give people a way to turn that option off, so that they don't break the ability to do debugging when testing the sending of messages to remote systems.
Of course, if people are on Google Groups, then this probably isn't an issue for them. And maybe that's the other part of the problem -- maybe Google sees this "feature" as being a competitive advantage for them with combining Google Groups and gmail working better together, and they don't see the benefit of making gmail be able to play better with the rest of the world.
If you think it's worthwhile, you could always try turning on personalization for the list, and then add a footer with unique information per recipient. That would cause the message-id to be unique as well as the message body, and wouldn't require any new code to be developed.
Brad Knowles <brad at shub-internet.org>
LinkedIn Profile: <http://tinyurl.com/y8kpxu>
More information about the Mailman-Users