Meeow miaou*
We spoke on IRC about the archiver the other day and I said that I
should present here my thoughts about it. So here they are (beware that
might be long).
First I think we should think about the structure/architecture of
things. We have a number of component which need to be archives aware,
without being exhaustive I'm thinking about:
- the archiver itself (which present the archive (ie: mails and threads)
- the NNTP bits which should be able to return emails and/or threads
- the stats module which want to give information to the user about the
health of the list itself (emails/month, last threads, biggest
threads...)
- archives retrieval (we probably want to give the user a way to
download the archives since the creation of the list/the last
year/month)
All of these components needs to be aware about the archives. We agreed
that the core does not want to know about it.
So we have several solutions:
- each module becomes an "archiver" wrt to core, meaning each module has
its own way to storing the archives (and eventually its own system to do
so)
- we create a archive-core module which manage the archives and provides
an API to access, modify, extend them.
Of course, we prefer the second solution :)
So we would have the following architecture:
mm-core (handles the lists themselves) --send emails to archivers-->
archive-core (store the emails and expose them through an API) -->
archivers/stats/NNTP
The questions are then:
- how do we store the emails ?
- how do we expose the API ?
- how to make it such that it becomes easy to extend ? (ie: the stats
module wants to read the db, but probably also to store information on
it)
Having played with mongodb (HK relies on it atm), I quite like the
possibilities it gives us. We can easily store the emails in it, query
them and since it is a NoSQL database system extending it becomes also
easy.
On the other hand, having the archiver-core relying on the same system
as the core itself would be nicer from a sysadmin pov. I have not tried
to upload archives to a RDBMS and test its speed, but for mongodb the
results of the tests are presented at [1].
The challenge will be speed and designing an API which allow each
component to do its work.
I think it would be nice if we could reach some kind of agreement before
the GSoC starts (even if we change our mind later on) to be sure that if
we get students their work don't overlap too much.
The second point I want to present is with respect to the archiver
itself.
At the moment we have HyperKitty (HK), the current version:
- exposes single emails
- exposes single threads
- presents the archives for one month or day
- allows to search the archives using the sender, subject, content or
subject and content
- presents a summary of the recent activities on the list (including the
evolution of the number of post sent over the last month)
I think these are the basis functionality that we would like to see in
an archiver.
But HK aims at much more, the ultimate goal of HK is to provide a
"forum-like" interface to the mailing-lists, with it HK would provide a
number of option (social-web like) allowing to "like" or "dislike" a
post or a thread, allowing to "+1" someone, allowing to tag the mails or
assign them categories.
These are all nice feature but, imho, they go beyond what one would want
from a basic archiver.
So what I would like to propose is to split HK into a sub-project
(MiniKitty?) which would provide these basic functionality.
We would keep HyperKitty as a more extensive archiver and try to bring
HK to its ultimate goal. This will need some more work and time as we
will have to make HK speak with core for authentication, find a way to
send emails to core/the lists and of course add all the other features
(tags, categories...)
Comments welcome :)
Thanks,
Pierre
[1]
http://blog.pingoured.fr/index.php?post/2012/03/16/Mailman-archives-and-mon…
* Hi everyone
The announcement from Google went out on Monday, so I'm a bit slow, but
I'd like to congratulate the three students who will be working with GNU
Mailman as part of Google Summer of Code 2012:
Aamir Khan will be working on improving Hyperkitty.
Alexander Sulfrian will be working on NNTP access to the archives.
George Chatzisofroniou will be working on metrics, integrating his
MailmanStats package with Mailman 3.0.
Congratulations to all of you, and thank you to all our other talented
applicants this year. It was a very tough field and we were very
lucky to get three slots allocated to us by our umbrella organization,
the Python Software Foundation. I believe this is the largest number of
students we've ever had!
We're now in what Google calls the "community bonding period" where we
get to know each other and get things set up. Students: this would be
a great time to post a short email to mailman-developers reminding
everyone about what you're planning to do this summer (Google only gives
the short abstract in their public list). I know the mentors are pretty
excited about the projects, and this gives everyone else a chance to get
excited too!
Coding starts on May 21st. So be prepared: this is going to be one
amazing year for Mailman!
Terri
I apologize, the simulation code had a flaw. I'm embarrassed that I
didn't immediately recognize this immediately from intuition. We
could get even more accurate results by computing actual SHA-1 of
actual message-ids, but I'm not sure it is worth the effort. Here is a
revised program and the results from one run. For the record, I'm
interested in billion+ message counts, for which we'd need a few more
hash characters. But not that many.
message count 10000000, hash length 4, collisions 99.992930%
message count 10000000, hash length 5, collisions 25.755240%
message count 10000000, hash length 6, collisions 0.928050%
message count 10000000, hash length 7, collisions 0.029860%
message count 10000000, hash length 8, collisions 0.000800%
message count 10000000, hash length 9, collisions 0.000060%
message count 10000000, hash length 10, collisions 0.000000%
===
#!/usr/bin/python
import random
def compute(message_count, hashlength):
database = {}
for i in range(message_count):
n = random.randint(0, pow(2, 5 * hashlength))
if n in database:
database[n] += 1
else:
database[n] = 1
collisions = 0
for i in database:
if database[i] > 1:
collisions += database[i]
p1 = (100.0 * collisions) / float(message_count)
print("message count %d, hash length %d, collisions %f%%" %
(message_count, hashlength, p1))
for i in range(4, 11):
compute(10000000, i)
Thanks!
On Wed, Apr 25, 2012 at 2:50 AM, Jeff Breidenbach <jeff(a)jab.org> wrote:
>> 0. Assume a 10 million message archive.
>> 1. What percentage of permalinks need another click?
>> 2. What percentage of permalinks will result in a list of more than 10 matches?
>
> Ignoring cross posts, for a 4 character hash:
>
> 1. Approximately 90%
> 2. Approximately 50%
Too high.
> Ignoring cross posts, for a 13 character hash:
>
> 1. Effectively 0%
> 2. Effectively 0%
Way too low because of "7 plus/minus 2". As a strawman suggestion,
I'd say that once you get down to 5% and 0.1% respectively, you're
done because in many contexts cross-posts will bring you up by a
similar order of magnitude anyway.
For five-character hash, I get 13% and 0% and for six, 0.4% and 0%. I
think 1 in 8 messages is probably more than I want to ask users for an
extra click, especially if cross-posts are kicking in substantially
too. So six looks to be the sweet spot.
So I'm convinced that a shorter hash makes a lot of sense.
Steve
> 0. Assume a 10 million message archive.
> 1. What percentage of permalinks need another click?
> 2. What percentage of permalinks will result in a list of more than 10 matches?
Ignoring cross posts, for a 4 character hash:
1. Approximately 90%
2. Approximately 50%
Ignoring cross posts, for a 13 character hash:
1. Effectively 0%
2. Effectively 0%
Pick message count and collision tolerance, and hash size will follow.
-Jeff
========== simulation code
#!/usr/bin/python
import random
hashlength = 4
message_count = 10000000
database = {}
collisions = 0
for i in range(message_count):
n = random.randint(0, pow(2, 5 * hashlength))
if n in database:
collisions += 1
database[n] += 1
else:
database[n] = 1
over_ten_collisions = 0
for i in database:
if database[i] > 10:
over_ten_collisions += database[i]
p1 = (100.0 * collisions) / float(message_count)
p2 = (100.0 * over_ten_collisions) / float(message_count)
print("Percent coliisions %f" % p1)
print("Percent over ten collisions %f" % p2)
On Tue, Apr 24, 2012 at 1:40 PM, Jeff Breidenbach <jeff(a)jab.org> wrote:
>>Is 4 bytes too short?
>
> Four characters is only about a million combinations. First collision
> is 50% likely at 1200 messages, and multi-million message databases
> are completely screwed.
If we're willing to impose disambiguation on the user (and ability to
find and report all matching messages on the UI), then the questions
to me would be
0. Assume a 10 million message archive.
1. What percentage of permalinks need another click?
2. What percentage of permalinks will result in a list of more than 10 matches?
Rationale for 0: 10 related lists X 20 years X 365 days X 100
messages/day. I can imagine people wanting to index into such a
corpus.
Rationale for 1: Obvious, I hope.
Rationale for 2: Maybe I'm just getting old, but that's the number of
lines I can comfortably scan in a glance. FVO of "10" that suit you,
I guess.
Note that, like Barry, I'm assuming disambiguation will be needed for
x-posts in any case. WDOT?
>
> Bottom line: how big a database do we expect to have, and amongst
> those messages, how many collisions are considered acceptable?
>
> -Jeff
>
> PS. These numbers assume a well balanced hash. This paper suggests
> SHA-1 is pretty good in non-adversarial situations, but I'm not an
> expert. http://cseweb.ucsd.edu/~mihir/papers/balance.html
> _______________________________________________
> Mailman-Developers mailing list
> Mailman-Developers(a)python.org
> http://mail.python.org/mailman/listinfo/mailman-developers
> Mailman FAQ: http://wiki.list.org/x/AgA3
> Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/
> Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/stephen%40xemacs.…
>
> Security Policy: http://wiki.list.org/x/QIA9
I want to step back for a moment and look at some fundamentals. I'll reply to
other messages in this thread later, but in the context of my own thoughts
expressed here.
TL;DR: I'm going to propose we keep the hash algorithm to include only the
Message-ID as input.
0) Mailman core doesn't care what the algorithm is for calculating a permalink
to the message in the archiver. All it cares about is that this can be
calculated using local information only, with no round-trip to the
archiver. Local information includes system configuration values, mailing
list settings, and information available in the message being posted.
1) Multiple archivers can be enabled in any running Mailman core system, and
these do not have to agree on the permalink calculation. Every archiver is
free to use whatever algorithm it wants.
2) RFC 2369 and RFC 5064 rule us. This means that Mailman core will add a
List-Archive header, which RFC 2369 defines as the "field describ[ing] how
to access archives for the list". This header does not point to a specific
message in the archive, but instead to the list's archive as a whole.
RFC 5064 defines the Archived-At header which "refer[s] to the archived
form of a single message." If you get a message from Mailman which
contains an Archived-At header, you should be able to click on that to view
the message in the archive. It's this that I'm calling the 'permalink' to
the message, and which must be calculated without round-tripping to the
archiver.
So what is this hash thing and why do we need it? Well, strictly speaking, we
don't. If UltraKitty wants to define the permalink as the URL-encoded
Message-ID concatenate with the List-ID and Date, that's fine by Mailman. As
long as item 0 above is satisfied, the core is happy.
Where I believe the hash is useful is by providing a more human-friendly
string for *a* permalink to the message. It doesn't have to be the only one;
it's just a convenience that you could imagine me reading to you over the
phone or typing into an SMS with my stubby old bass player fingers.
If you think of an archiver in REST terms, the RFC 5064 header value is just
one location for the resource (i.e. the message you care about) in the
archiver. That same resource can have many different addresses; maybe you can
look it up by raw Message-ID, URL-encoded Message-ID, permalink hash, or
whatever. All roads lead to the same resource in the archiver. The permalink
hash isn't required for any of this to work, it's purely a convenience.
It doesn't even matter if the permalink points to multiple resources.
Let's say a message gets cross-posted so that multiple copies of it show up at
the archiver with the same Message-ID. The archiver can certainly treat these
as separate resources, living at different canonical locations in its resource
tree. But it could *also* honor a permalink that is identical for both
messages. If you think of this as a tiny url for a search query, that could
return multiple hits, each of which would be the different versions of the
cross-posted message.
I could imagine a better UI though. Let's say this message got cross-posted:
From: Anne Person <aperson(a)example.com>
To: ant(a)example.org, bee(a)example.org
Subject: Ants and Bees are best friends!
Message-ID: <alpha>
Why should we fight? The mosquitoes are our common enemy!
Now, if all we use is the Message-ID to calculate the permalink, you might see
both messages delivered to both mailing lists with the following RFC 5064
header:
Archived-At: http://lists.example.org/XZ3DGG4V37BZTTLXNUX4NABB4DNQHTCP
You click on that url and you're taken to a page which contains the archived
message for one of the mailing lists (it doesn't matter which one), but you
see a little extra link on the page:
View cross-posted thread in [ant(a)example.org] or [bee(a)example.org]
and those two links take you to the separate messages, at their canonical
locations, in the thread appropriate for one or the other mailing list.
I'll note that in another message, Jeff advocates for using something shorter
than 32 bytes for the hash, and letting collisions just work themselves out.
Which frankly would be fine by me, but I see that as a very similar problem to
the cross-posting problem. If the header said this instead:
Archived-At: http://lists.example.org/4DNQHTCP
clicking on it might bring you to a page like this:
Did you mean:
* [Ants and Bees are best friends] in [ant(a)example.org]
* [Ants and Bees are best friends] in [bee(a)example.org]
* [Mosquitoes unite!] in [bloodsuckers(a)example.org]
* [What about us dung beetles?] in [pests(a)example.org]
Notice I haven't advocated for a particular hash algorithm, because for my
purposes right here, it doesn't matter. What I do feel strongly about is that
the input to that hash should only include information directly available in
the originally posted message, e.g. Message-ID. That way, if I get a copy
from the mailing list, but you get a copy directly, we both have (almost[*])
all the information we need to calculate the same RFC 5064 URL.
Cheers,
-Barry
[*] The one missing piece for the off-list copy is the value of the
List-Archive header. If you can't find that out any other way, you're
screwed, but my guess is that in practice, it will be easy to find that out.
Or you can just Google the permalink hash or Message-ID to find the message.
On Apr 23, 2012, at 02:33 PM, Stephen J. Turnbull wrote:
>What are you doing hanging out in e-mail circles, then? ;-) It has to
>be the most prominent example of a computing field where smtplicity
>has led to a disaster, and it has never been easy to understand!
For some reason, I have Ren and Smtpy on the brain today. :)
>I dunno. It seems to me that a lot of the lists where the users and
>admins would not care about flexibility, machine-friendliness, and
>continuity are also good candidates for moving to web forums in any
>case. (Yeah, I know that Barry wants to kill web forums; but if so,
>those users are going to have to coexist with mine!)
Well, Paul Graham made me want to kill email, forums, archives, newsgroups,
and IMAP. I don't know how to do it, but be on notice you old <wink> media,
we're gunnin' for ya.
>I don't find hashes (or most message IDs) to be human-friendly[1] at
>all, and (if it's not going to be a tinyurl) I really only want the
>line containing the URL to be less than 78 characters so I can be
>pretty sure nothing is going to try to insert a linebreak. I guess
>we're just going to have to agree to disagree on a lot of things. ;-)
FWIW, a tinyurl would be very cool, but I don't know how to calculate those
without state, and state probably requires communication between the archiver
and Mailman.
>Bottom line: I don't have a problem with you having your preferences,
>and if you "win" I can work around it, but I do have multiple use
>cases for List-Id != List-Post that I raise for consideration by the
>group.
>
>[1] FVO "human" that includes a larger population than "geeks like
>me"! My MUA is set to display Message-Id and References by default!
Mine too. :)
-Barry
On Apr 22, 2012, at 03:31 PM, Jeff Breidenbach wrote:
>I find List-Id annoying because I like the world to be simple and easy to
>understand. People who know nothing about RFCs natually consider the
>posting address to be the canonical name of a mailing list. We should be
>embracing that.
In theory, I agree, which is why you'll see the posting address used as the
primary key into the mailinglist table in the database. You'll almost always
see that in the To: header, although some situations might cause mismatches,
e.g. acceptable aliases. According to the language in RFC 2369, this also
(only!) may show up in the List-Post header for the from-list copy of the
message, but you might find the List-Post has nothing useful, e.g. "NO" (with
some random goo of trailing comment). List-ID will always be included in the
from-list copy.
>Instead, RFC2369 introduces this entire alternate namespace with List-Id,
>competing for attention, with its own weird rules like the domain-control one
>quoted earlier in its thread. All this confusion, and the main problem it
>tried to address isn't very important. Is it really such a disaster for a
>list to be considered different if it hops to a new domain? I don't think so,
>or there would be a lot more clamoring for editable List-Id in
>mailman. Archival services certainly don't need it. It smells like design by
>committee where everyone's pet feature for a rare use case gets added in,
>without appreciating the benefits of small and simple and
>less-stuff-is-better.
>
>Regarding hashes, the whole point of a archival hash is to make a shorter,
>human friendly URL. This is not very to implement; one can take the SHA-1
>and truncate it. If we aren't worried about length then Message-Id is a
>perfectly usable identifier.
Well, Message-ID also fails the "human friendly" part, e.g. from your message:
Message-ID:
<CAHjiUboZzYAE1dgApzfFNMqv_X+X6xA9E1bOc3F+mQ3qcTeeeQ(a)mail.gmail.com>
My fingers already hurt typing that into my spell-correcting smartphone. :)
FWIW, Base 32 was deliberately chosen because it's case insensitive, and
allows for optional mapping of commonly mistaken substitutions (e.g. zero for
oh, and one for eye or el). IOW, it's much more forgiving of the human part
of the process.
>Certainly no need for a triumverate of short hash, long hash, and
>message-id. Less is better.
Agreed. Is 32 bytes too long? Is 4 bytes too short? What's an acceptable
trade-off between collision likelihood and human convenience?
-Barry
On Apr 20, 2012, at 01:19 PM, Jeff Breidenbach wrote:
>1) Terri is exactly right. The reason for including list identity as
>part of the hash calculation is for cross-posted messages. An
>archiving service shows context. Here's the message AND the thread it
>fits into, AND information about the list it travelled over AND the
>ability to search that list further. Archives need to know the list to
>provide context.
Agreed, but I think you'll get all that information anyway, without it being
expressed in the hash. You'll get a full copy of the posted message, so
you'll get the Message-ID, To header (i.e. the posting address), List-Post (if
there is one), List-ID, etc.
>2) The reason mail-archive.com uses List-Post and not List-Id in the
>calculation is because every list, RFC2369 compliant or not, has a
>concept of a posting address. It is natural idea, easy to think of and
>understand. Hence all mail-archive.com archives are keyed off of
>posting address. It would be technical possible (but an architectural
>pain) for mail-archive.com to calculate using List-Id. We'd probably
>not bother and instead store whatever was calculated by mailman and
>placed in the Archived-At: header. Okay, I'll admit my prejudice. I've
>always found List-Id annoying, and wish that it didn't exist.
Note that the message you receive may not have a useful List-Post header at
all! From RFC 2369:
3.4. List-Post
The List-Post field describes the method for posting to the list.
This is typically the address of the list, but MAY be a moderator, or
potentially some other form of submission. For the special case of a
list that does not allow posting (e.g., an announcements list), the
List-Post field may contain the special value "NO".
(I think neither mm2 nor mm3 does this right. See LP: #987563)
>3) As long as things are changing, I want to mention that these URLs
>feel too long. SHA-1 is a 160 bit hash consuming 32 URL characters. I
>think trimming to a 64 bit (13 character) hash is plenty. According to
>wikipedia collision tables, with the shorter hash we'd expect to get
>our first collision after archiving 5 billion messages. That's 50X the
>current corpus size of public archival services like GMane. And it
>isn't like an occasional hash collision is a big deal or a security
>problem. http://en.wikipedia.org/wiki/Birthday_attack
Let's say we take the lower 80 bits of the SHA1. After base32 encoding, that
leaves us with 16 bytes. Of course, we could also use the full 160 bit SHA1
hash, and take only the lower X number of bytes after the base32 encoding.
I'm all in favor of a shorter URL, but someone with better Maths-Fu will have
to propose a specific algorithm that adequately trades off collisions for
human-friendliness. Also, note the implications of increased collisions on
the whole argument, which I brought up in my previous message.
>3b) For that matter, a sequence number would also do the trick, but I
>can understand that this is much more dangerous; it is easy for a
>sequence number to get reset and cause all hell to break loose.
It would also be nearly impossible to preserve the zeroth principle, that
Mailman and the archiver can agree on the permalink for a message with no
communication between them.
>4) I'm really not that picky. Our archival service could deal with all
>sorts of URLs, including the ones Terri was trying to avoid, such as
>http://example.com/archiver/listname.example.com/$hash
>In fact, we've found that lots of small, per-list databases have speed
>and reliability advantages over big global databases. But I also like
>short URLs. Bottom line, please don't let these comments delay or
>derail forward progress.
No worries! We'll hash (pun intended ;) this out in plenty of time before 3.0
final. With Richard's suggestion of a version number, we could even roll out
updates in future versions, although it would probably be more of a PITA for
you by then, than us. :)
Cheers,
-Barry