Re: [Mailman-Developers] Python 3
Aurelien Bompard writes:
Barry writes:
One of the advantages of accessing the core through the REST API is that it doesn't matter what clients like HyperKitty and Postorious are written in.
I'm not exclusively using the REST API though. I'm importing a couple interfaces, mostly the archiver interface.
This was a design mistake, I think. An IArchiver really needs to be in core for two reasons: first, it needs to generate a permalink and attach it to the message as distributed. Second, it needs to associate that permalink with the message so the "real" archiver process will do set that link correctly. For a production-quality archiver, that's *all* it should do. The archiver itself should be a separate process, receiving the message and permalink data by IPC.
@Barry: maybe rename IArchiver to IPermalinker? ;-)
Remember, Mailman core is going to be distributing Internet mail. Except in the case where 100.00% of the users are on one host, that's going to be the bottleneck on message processing. Archiving simply does not need to be fast. The archiver can implement LMTP even, although that would be overkill if we didn't already have an LMTP server. The simplest approach would be to simply put the Archive-Permalink header in the message and stream it to the archiver which would parse it out.
I'm also using the custom Message class a lot in the tests.
Can't avoid that with Python 2, I guess, but using Message will be *so* much less painful with Python 3.
But I think the main problem is the import of mailman's config object in the class that implements the IArchiver interface. I don't believe there's another way to get the configuration.
If you need that configuration (which, come to think of it, you probably do, at least parts of it), you could have a private protocol for communicating it as metadata (a message header or as metadata in the stream).
And now that I think of it, the archiver interface will be imported by Mailman core, and will thus run with a Python3 interpreter. As a result, all of KittyStore must at least be Python3 compatible.
In the long run (ie, when nobody who's anybody uses Python 2 at all) I think everybody would be happier if you refactor to keep KittyStore at arm's length from Mailman core.
On Dec 27, 2014, at 12:42 PM, Stephen J. Turnbull wrote:
Remember, Mailman core is going to be distributing Internet mail. Except in the case where 100.00% of the users are on one host, that's going to be the bottleneck on message processing. Archiving simply does not need to be fast. The archiver can implement LMTP even, although that would be overkill if we didn't already have an LMTP server. The simplest approach would be to simply put the Archive-Permalink header in the message and stream it to the archiver which would parse it out.
Remember too that in MM3, messages only get fed to the registered IArchiver interfaces by a separate archive runner. So they aren't a bottleneck for delivery to the user, but on heavily trafficked sites, they can potential consume a lot of resources if the archiver is local and relatively inefficient.
I tend to agree that a good design for any archiver is to be able to accept messages over an IPC channel. A site may for example want to run the core on one system and HK on another system (e.g. separate VMs perhaps). This would only really be possible if the core can feed HK messages over a configurable IPC. As I mentioned, I think LMTP *could* work, but REST (inside HK) could work too. Aurelien, what do you think?
I'm also using the custom Message class a lot in the tests.
Can't avoid that with Python 2, I guess, but using Message will be *so* much less painful with Python 3.
See my other follow up. I don't think there's much in the py3 branche's Message class that is going to be super helpful to HK. Even the core *could* use the standard email.message.Message class with a couple of utility functions, but it's just a little easier to add a few properties and methods in the subclasses.
In the long run (ie, when nobody who's anybody uses Python 2 at all) I think everybody would be happier if you refactor to keep KittyStore at arm's length from Mailman core.
Agreed, with of course the caveat that we'll need a thin HK IArchiver implementation in the core to generate the permalink and communicate with HK over IPC. Generally we want the permalink to be able to be generated without direction communication with the archiver (see the motivation for X-Message-ID-Hash), but if the core *has* to talk to HK to generate the permalink, then I don't think an LMTP channel will work. In that case REST or some homegrown protocol may be the answer.
Cheers, -Barry
Barry Warsaw writes:
Remember too that in MM3, messages only get fed to the registered IArchiver interfaces by a separate archive runner. So they aren't a bottleneck for delivery to the user, but on heavily trafficked sites, they can potential consume a lot of resources if the archiver is local and relatively inefficient.
I'm talking about total load on the server host, not load on the Mailman subsystem. So I don't think the Mailman-to-archive function will consume many resources compared to delivery to subscribers if there are any remote users at all. A local archiver communicates at CPU-to-disk speed basically once or maybe twice as I understand it. The MTA resources for queuing alone will exceed and probably overwhelm this. Then there are the multiple Mailman queues, etc, etc.
Of course the *other* side of the archiver (the client access to the message store) can be extremely resource consuming. I'm just saying that in the grand scheme of message distribution (including to the archiver), the efficiency of a local archiver is not going to be a bottleneck.
In the long run (ie, when nobody who's anybody uses Python 2 at all) I think everybody would be happier if you refactor to keep KittyStore at arm's length from Mailman core.
Agreed, with of course the caveat that we'll need a thin HK IArchiver implementation in the core to generate the permalink and communicate with HK over IPC. Generally we want the permalink to be able to be generated without direction communication with the archiver (see the motivation for X-Message-ID-Hash),
By the way, I would say to adopt modern IETF practice here and drop the "X-" (in practice collisions are rare while the annoyance of fixing platforms to use the standardized name is frequent), and include the algorithm in the name. Eg, Message-ID-MD5 or Hashed-Message-ID-MD5. Or we could use the List-* namespace.
We should do this while we still can. :-) If you want I can try to write an RFC to make it official.
but if the core *has* to talk to HK to generate the permalink,
I personally don't think that is a good idea, but see below.
then I don't think an LMTP channel will work.
The only reason I can think of is that you want to check that the permalink isn't already occupied (that's the only thing HyperKitty proper knows that can't be computed the same way in the IArchiver as in HyperKitty proper AFAICS), and that can be implemented as follows
Mailman> LHLO mailman-host HyperKitty> 250 OK Mailman> MAIL FROM Mailman@mailman-host HyperKitty> 250 OK Mailman> RCPT TO <permalink-variable-part>@archiver-host HyperKitty> 553 Permalink already occupied Mailman> RCPT TO <new-permalink-variable-part>@archiver-host HyperKitty> 250 OK Mailman> DATA HyperKitty> 354 Go for it!
and so on. I don't think this even violates the spirit of the LMTP protocol, but it certainly conforms to the letter as long as permalink variable parts are valid email localparts. (One could quibble about which 5xx response to give. AFAICS only "551 user not local" is out.)
My own preference is for a permalink that can be computed from the originator header data (author, recipients, date, message ID, subject) by anyone with access to the message, and that means you need the archive server to be able to deal gracefully with collisions. (In practice message IDs are not perfect UUIDs, although they're very close, and some messages don't have them or have different ones assigned by mediating hosts at arrival at multiple recipients.)
Steve
On Dec 27, 2014, at 03:57 PM, Stephen J. Turnbull wrote:
By the way, I would say to adopt modern IETF practice here and drop the "X-" (in practice collisions are rare while the annoyance of fixing platforms to use the standardized name is frequent), and include the algorithm in the name. Eg, Message-ID-MD5 or Hashed-Message-ID-MD5. Or we could use the List-* namespace.
We should do this while we still can. :-) If you want I can try to write an RFC to make it official.
I like the idea of putting this information in a List-* header, and I'll take you up on the RFC offer. Are you thinking about trying to push this through the IETF to make it official?
The spec currently lives on the wiki:
http://wiki.list.org/display/DEV/Stable+URLs
MM3 and HK should both be implementing this now, and I think mail-archive.com does too.
If we change the header name, I'd want to keep X-Message-ID-Hash for the MM3 final release, but deprecate it. I.e. MM3 would write both headers.
As for what the List-* header would be, well, if you wanted to include the algorithm name, to be completely accurate it would have to be something like List-Base32-Encoded-SHA1-Hash-Of-The-Message-ID. Yuck ;)
The value of this header both serves to uniquely identify the message in a more regular format, and to serve as the final path component in the Archived-At (RFC 5064) header. So the following names come to mind:
List-Message-ID List-Archive-ID List-Archived-At-ID
suggestions welcome.
The only reason I can think of is that you want to check that the permalink isn't already occupied (that's the only thing HyperKitty proper knows that can't be computed the same way in the IArchiver as in HyperKitty proper AFAICS)
Right. However, when this was discussed several years ago, the mail-archive.com guys did some extensive data analysis on their vast collection of email. You'd have to go spelunking in the -developers archives for details, but I recall that the collision rate was so small as to be effectively negligible, even more so if you ignore spam. And if the X-Message-ID-Hash collides, then the Message-ID will collide, and it's likely that any archiver would drop the message anyway.
My own preference is for a permalink that can be computed from the originator header data (author, recipients, date, message ID, subject) by anyone with access to the message, and that means you need the archive server to be able to deal gracefully with collisions. (In practice message IDs are not perfect UUIDs, although they're very close, and some messages don't have them or have different ones assigned by mediating hosts at arrival at multiple recipients.)
Right, we hash (pun intended :) all this out years ago. We can ignore collisions, and we can do the entire calculation on the server side, using Message-ID as the sole input. I think the only issue that's worth reopening is the name of the header.
Cheers, -Barry
Barry Warsaw writes:
I like the idea of putting this information in a List-* header, and I'll take you up on the RFC offer.
OK.
Are you thinking about trying to push this through the IETF to make it official?
Yes. It will depend on how much resistance I get, but having it already implemented and used in Mailman will certainly help. On the other hand, there may be resistance on the basis that RFC 5064 already does everything that is "really" needed.
The spec currently lives on the wiki:
Yes, I'm a little bit familiar with that spec. :-)
If we change the header name, I'd want to keep X-Message-ID-Hash for the MM3 final release, but deprecate it. I.e. MM3 would write both headers.
I'll ask some of the IETF guys what they think about that. But if you put it in a public release, you're screwing the same kind of people Tanstaafl was talking about. Beta testers (and I mean beta testers, ie, people who have put the code in production even though it's not considered a public release) have signed up for this kind of annoyance. Random ancient Debian sysadmins haven't.
Of course we don't want to abuse our beta testers if we can avoid it, but I think if we don't want to maintain dual headers indefinitely, the public release is the time to get rid of the X- version.
As for what the List-* header would be, well, if you wanted to include the algorithm name, to be completely accurate it would have to be something like List-Base32-Encoded-SHA1-Hash-Of-The-Message-ID. Yuck ;)
We'd have to think somewhat carefully about how strong a hash we want to use if we don't specify algorithm in the field name. I'm not particularly concerned with how many bytes the header takes up. Future users can just deal with the implied BASE32 vs. BASE85 or whatever. However, if somebody thinks they need a stronger hash than we chose, we'll have interoperability problems for people who receive the message off-list.
The value of this header both serves to uniquely identify the message in a more regular format, and to serve as the final path component in the Archived-At (RFC 5064) header. So the following names come to mind:
List-Message-ID List-Archive-ID List-Archived-At-ID
suggestions welcome.
The last two are too easily confused with Archived-At.
Right. However, when this was discussed several years ago, the mail-archive.com guys did some extensive data analysis on their vast collection of email. You'd have to go spelunking in the -developers archives for details, but I recall that the collision rate was so small as to be effectively negligible,
Yes. The problem is that there are people out there with MUAs that provide bogus Message-IDs (Kyle Jones's VM used to do that), and for those people all messages after the first get dropped.
Note that if the server does indeed ignore the possibility of collisions on Message-ID, then there is no need (AFAICS) for the "thin" IArchiver to communicate with the archiver proper. I don't see how it hurts to provide for the possibility of an archiver that does check content.
Right, we hash (pun intended :) all this out years ago. We can ignore collisions, and we can do the entire calculation on the server side, using Message-ID as the sole input. I think the only issue that's worth reopening is the name of the header.
Well, that's true for *us*. The folks at the IETF don't have a habit of leaving well enough alone, though. ;-)
On Dec 29, 2014, at 10:13 AM, Stephen J. Turnbull wrote:
If we change the header name, I'd want to keep X-Message-ID-Hash for the MM3 final release, but deprecate it. I.e. MM3 would write both headers.
I'll ask some of the IETF guys what they think about that. But if you put it in a public release, you're screwing the same kind of people Tanstaafl was talking about. Beta testers (and I mean beta testers, ie, people who have put the code in production even though it's not considered a public release) have signed up for this kind of annoyance. Random ancient Debian sysadmins haven't.
Of course we don't want to abuse our beta testers if we can avoid it, but I think if we don't want to maintain dual headers indefinitely, the public release is the time to get rid of the X- version.
I'd be willing to drop it if we can get agreement on the new header, and get buy-in from at least HK (abompard) and the mail-archive.com folks. AFAIK, they are the only two "clients" of the header atm. I'm not sure if the Jeffs are still reading this list, so I've CC'd them directly.
Jeffs: we are considering changing the X-Message-Hash-ID header name, at least dropping the X- prefix and possibly renaming the header.
As for what the List-* header would be, well, if you wanted to include the algorithm name, to be completely accurate it would have to be something like List-Base32-Encoded-SHA1-Hash-Of-The-Message-ID. Yuck ;)
We'd have to think somewhat carefully about how strong a hash we want to use if we don't specify algorithm in the field name. I'm not particularly concerned with how many bytes the header takes up. Future users can just deal with the implied BASE32 vs. BASE85 or whatever. However, if somebody thinks they need a stronger hash than we chose, we'll have interoperability problems for people who receive the message off-list.
Base 32 is a good trade-off between compactness and readability.
The value of this header both serves to uniquely identify the message in a more regular format, and to serve as the final path component in the Archived-At (RFC 5064) header. So the following names come to mind:
List-Message-ID List-Archive-ID List-Archived-At-ID
suggestions welcome.
The last two are too easily confused with Archived-At.
Suggestions welcome. :)
Right. However, when this was discussed several years ago, the mail-archive.com guys did some extensive data analysis on their vast collection of email. You'd have to go spelunking in the -developers archives for details, but I recall that the collision rate was so small as to be effectively negligible,
Yes. The problem is that there are people out there with MUAs that provide bogus Message-IDs (Kyle Jones's VM used to do that), and for those people all messages after the first get dropped.
As you know, I have limited tolerance for broken MUAs. Gosh, do people still use VM? :)
Note that if the server does indeed ignore the possibility of collisions on Message-ID, then there is no need (AFAICS) for the "thin" IArchiver to communicate with the archiver proper.
Right. MM3 does not current reject messages with duplicate Message-IDs, but I think it should. I had a branch in flight that implemented this, but it caused some failures I wasn't able to resolve, and the branch bitrotted.
I don't see how it hurts to provide for the possibility of an archiver that does check content.
Right, we hash (pun intended :) all this out years ago. We can ignore collisions, and we can do the entire calculation on the server side, using Message-ID as the sole input. I think the only issue that's worth reopening is the name of the header.
Well, that's true for *us*. The folks at the IETF don't have a habit of leaving well enough alone, though. ;-)
Right, so let's do what *we* think is right, right now, and let the committee take 10 years to define a standard. ;)
Cheers, -Barry
No Jeff-relevant discussion here, I think, so I'm not going to spam.
Barry Warsaw writes:
The last two are too easily confused with Archived-At.
Suggestions welcome. :)
When I have one, of course. But it's worth ruling out non-starters quickly, if possible.
Yes. The problem is that there are people out there with MUAs that provide bogus Message-IDs (Kyle Jones's VM used to do that), and for those people all messages after the first get dropped.
As you know, I have limited tolerance for broken MUAs.
As I also know, you don't usually impose your opinions on third parties with a different point of view. And the first mission of mail-related applications is to make sure mail gets to where it's suppose to go. Or do you want your name cursed in the same breath with AOL and "Yahoo!"? :-)
Gosh, do people still use VM? :)
Sure. The main difference between VM "virtual folders" and Gmail "labels" is that virtual folders actually do what labels are advertised to do. :-)
Right. MM3 does not current reject messages with duplicate Message-IDs, but I think it should.
That sounds like a mess to me. For one thing, do you mean "reject" (and the sender gets a bounce) or "discard" (silently)? Neither of those sounds like a good thing to me.
I'd much rather that it reject messages with duplicate content and different Message-IDs. ;-)
Well, that's true for *us*. The folks at the IETF don't have a habit of leaving well enough alone, though. ;-)
Right, so let's do what *we* think is right, right now, and let the committee take 10 years to define a standard. ;)
Hey, that's *my* ten years you're offering there!
Thank you for the CC, it is appreciated.
We have no problem with changing, or even completely dropping X-Message-ID-Hash. Mail-archive.com doesn't look at it. Instead, we parse Message-Id and do calculations from there.
By the way, direct use of message-id seems to be growing in popularity. For example, I constantly see links to Debian's msgid-search. I think that trend is going to continue, boosted by the fact that it is easy to embed a link to a long URL in HTML mail.
Jeff
Am 27.12.2014 um 05:18 schrieb Barry Warsaw:
I tend to agree that a good design for any archiver is to be able to accept messages over an IPC channel. A site may for example want to run the core on one system and HK on another system (e.g. separate VMs perhaps). This would only really be possible if the core can feed HK messages over a configurable IPC. As I mentioned, I think LMTP *could* work, but REST (inside HK) could work too. Aurelien, what do you think?
How about a third option, a generic "pub/sub" or "event" archiver, implemented in py3? It would meet all the above criteria (archivers on other systems, no dependency on py3 -- or Python for that matter).
We could start supporting one backend (zeromq for instance) and maybe add more later.
Flo
On Dec 27, 2014, at 08:41 AM, Florian Fuchs wrote:
How about a third option, a generic "pub/sub" or "event" archiver, implemented in py3? It would meet all the above criteria (archivers on other systems, no dependency on py3 -- or Python for that matter).
We could start supporting one backend (zeromq for instance) and maybe add more later.
This would be pretty cool. Would you be interested in writing an IArchiver implementation for that?
Cheers, -Barry
Am 29.12.2014 um 00:27 schrieb Barry Warsaw:
On Dec 27, 2014, at 08:41 AM, Florian Fuchs wrote:
How about a third option, a generic "pub/sub" or "event" archiver, implemented in py3? It would meet all the above criteria (archivers on other systems, no dependency on py3 -- or Python for that matter).
We could start supporting one backend (zeromq for instance) and maybe add more later.
This would be pretty cool. Would you be interested in writing an IArchiver implementation for that?
Sure. I don't know how much interest there is out there in something like this. But I think I like the idea enough to just do it anyway. :-)
Florian
This was a design mistake, I think.
Yeah, I see that now. But at that time, if I had known that I had two more years (and counting) before MM3 was released, I'd probably have made different choices. I hope I'm not making the same mistake again.
As I mentioned, I think LMTP *could* work, but REST (inside HK) could work too. Aurelien, what do you think?
I'd go with REST, it seems more flexible and we already have nice libraries for it.
A.
participants (6)
-
Aurelien Bompard
-
Barry Warsaw
-
Barry Warsaw
-
Florian Fuchs
-
Jeff Breidenbach
-
Stephen J. Turnbull