[Twisted-Python] clustering or process group replication
![](https://secure.gravatar.com/avatar/cbf5bb239a63953178ad80dcb79c4193.jpg?s=120&d=mm&r=g)
Has any work been done to enable process group communication similar to what Spread (spread.org), Ensemble, Horus, Isis, Eternal etc. offer? - Laran
![](https://secure.gravatar.com/avatar/d7875f8cfd8ba9262bfff2bf6f6f9b35.jpg?s=120&d=mm&r=g)
On Thu, 2005-05-26 at 12:10 -0400, Laran Evans wrote:
Has any work been done to enable process group communication similar to what Spread (spread.org), Ensemble, Horus, Isis, Eternal etc. offer?
I've built a reliable multicast (1->N ordered reliable message delivery, NACK-based, with congestion control) system for work using Twisted and the Fusion C++ Twisted bindings (http://itamarst.org/software/), but it is not open source. -- Itamar Shtull-Trauring <itamar@itamarst.org> http://itamarst.org
![](https://secure.gravatar.com/avatar/fe2b2c6cf2a61cd802e9d8233dc17ff7.jpg?s=120&d=mm&r=g)
On 27/05/2005, at 2:34 AM, Laran Evans wrote:
Useful? Wow, this is *exactly* what we are after for use within the Access Grid (www.accessgrid.org). This community has invested a lot of time in getting multicast connectivity between the 200+ sites currently involved. Ie., we've got the network backbone. At the moment we are only using multicast for the RTP media transport (the old mbone tools vic and rat, the newer VideoPresence tool VP). It would be great though to have a framework reliable message delivery over multicast for some of the lighter weight collaborative tools we are building. At the moment, AccessGrid uses a centralized VenueServer and TCP for coordinating sessions. Cheers, darran. Darran Edmundson (darran.edmundson@anu.edu.au) ANU Supercomputer Facility Vizlab Australian National University, Canberra, ACT 2600 tel: +61 2 6125-0517 fax: +61 2 6125-5088
![](https://secure.gravatar.com/avatar/cbf5bb239a63953178ad80dcb79c4193.jpg?s=120&d=mm&r=g)
Hmmm. Well, it sounds like there's a bit of interest here. I'm game for starting a discussion about what such a toolkit might look like if anyone else is interested. There are plenty of toolkits that do this sort of thing. But I'm interested in building one from scratch. And it seems like there's already a lot to Twister to plug into. If any of the Twister admins have any suggestions, objections or otherwise they would be appreciated. If you're not interested in building something like this into Twister it can always be done in parallel. - Laran Darran Edmundson wrote:
-- - Laran Evans
![](https://secure.gravatar.com/avatar/fe2b2c6cf2a61cd802e9d8233dc17ff7.jpg?s=120&d=mm&r=g)
On 27/05/2005, at 11:52 AM, Laran Evans wrote:
Our guys here at the ANU Supercomputer Facility are using "MCLv3: an Open Source Implementation of the ALC and NORM Reliable Multicast Protocols": http://www.inrialpes.fr/planete/people/roca/mcl/ldpc_infos.html in their software for replication of kernel images out to the 1600 processors on our new SGI Altix system. They are really happy with the library. On my personal to-do list is creating python bindings for MCLv3 which I then planned to use with twisted for some Access Grid work ... Cheers, darran. Darran Edmundson (darran.edmundson@anu.edu.au) ANU Supercomputer Facility Vizlab Australian National University, Canberra, ACT 2600 tel: +61 2 6125-0517 fax: +61 2 6125-5088
![](https://secure.gravatar.com/avatar/fe2b2c6cf2a61cd802e9d8233dc17ff7.jpg?s=120&d=mm&r=g)
On 27/05/2005, at 12:46 PM, Laran Evans wrote:
Is MCLv3 also being used for the accessgrid work?
No, that's what I want to implement. In the current model, the VenueServer acts as a common data store. Fine for little things but obviously doesn't allow impromptu "ah, here's a 100Mb dataset for everyone to preview".
I'm curious. Why was MCLv3 chosen?
By virtue of being the first library they found that fulfilled their needs ;-) Our guys have a 400Mb compressed node image that the want to distribute. It is transmitted in 1k packets (FEC algorithm takes a list of pointers so no need to actually break the data up), each 32Mb chunk being verified against a SHA1 hash. Note, there is *no* feedback from the clients, we just set a duplication factor of 1.5, a number empirically determined to give reliable transfer on the local network. Cheers, Darran.
![](https://secure.gravatar.com/avatar/3a301d7ca78eb4da89bce878608add09.jpg?s=120&d=mm&r=g)
What would be the steps involved in integrating a few of the various protocols involved here. Is there a chance of having a common API, or do the "large blocks on the internet" variants need something different than something optimized for "small blocks on the lan". I don't have a really good notion of all these things, but I'm considering trying to integrate the spread.org stuff into twisted to achieve clustering of servers into a logical virtual server, for redundancy. There might be a simpler way to do the same thing. I'm still exploring. Does anyone know of an example API for integrating several of the protocols into the same framework? How do these types of protocols break down into categories? --Ian
![](https://secure.gravatar.com/avatar/cbf5bb239a63953178ad80dcb79c4193.jpg?s=120&d=mm&r=g)
Well, there are a couple of things to consider. Here are a few. 1/. How many processes in the group. Most protocols work well in groups up to a few hundred. When you get over a thousand or so things start to get complicated. And above that there are protocols which use epidemic techniques, so-called "gossip" or "rumor"-based protocols. Bimodal multicast is one such protocol. 2/. Network environment. Protocols that run exclusively on a LAN can make certain assumptions that WAN-inclusive protocols can't. So, some protocols shift things around a bit, intentionally slowing things down to accomodate the WAN nodes. So the whole system runs some factor slower, but reliability is maintained. Beyond that there are all sorts of techniques, each of which makes certain assumptions. So, which protocol to utilize depends on what type of environment you want to create for applications. Spread's definitely a solid choice. JGroups, Ensemble, Horus, Isis, Totem, Transis and Eternal are all top-notch as well, each of which provides slightly different properties. There has been some talk lately on the JGroups mailing list about adding python extensions to JGroups. JGroups is what's used for clustering in Tomcat and JBoss I believe. Though I could be wrong about that. There's one design in which I'm quite interested which was initially proposed for an international stock exchange. You can read about that protocol here: http://portal.acm.org/citation.cfm?doid=380749.380771. This particular design is specifically designed to work on the Internet. And is based on TRMP. So, I hope this helps. - Laran Ian Duggan wrote:
-- - Laran Evans
![](https://secure.gravatar.com/avatar/3a301d7ca78eb4da89bce878608add09.jpg?s=120&d=mm&r=g)
Well, there are a couple of things to consider. Here are a few.
Thanks, this is great.
I just took a peek at JGroups. It looks like their API includes a lot of switches you can tweak. Would that be a good starting point? It looks like it's used in some significant stuff. http://www.jgroups.org/javagroupsnew/docs/success.html Their overview page (http://www.jgroups.org/javagroupsnew/docs/overview.html) makes a decent case for their approach. I wonder if doing a mixin/layered approach to building protocols adds much overhead? Transport protocols: UDP (IP Multicast), TCP, JMS Fragmentation of large messages Reliable unicast and multicast message transmission. Lost messages are retransmitted Failure detection: crashed members are excluded from the membership Ordering protocols: Atomic (all-or-none message delivery), Fifo, Causal, Total Order (sequencer or token based) Membership Encryption As for the others, I would need to read more about them. Do you know of any sort of comparison matrix that includes these projects? Ensemble is work that happened after Horus, right? Does that means it's more complete/featureful and should be considered over Horus? Ie, has Horus added new things since they split? --Ian
![](https://secure.gravatar.com/avatar/cbf5bb239a63953178ad80dcb79c4193.jpg?s=120&d=mm&r=g)
Ian Duggan wrote:
There is a paper out there that makes the statement that each layer you add to a system adds 10% overhead. So, by that rationale, stacking protocols is scalable only to a certain point. But, what that paper's really referring to is adding a programming language on top of the underlying hardware adds 10%, which is pretty silly because, yeah, you have to add overhead to provide a whole new set of capabilities. But without the capabilities provided by the programming language, your underlying hardware wouldn't be worth much. So, I don't think layering protocols adds much overhead at all. But there are better ways and not so good ways to do it. And I'm not a huge fan of the way JGroups and in fact Horus and Ensemble do it, basically the lego block approach because a given layer can only be stacked on top or below certain other layers. You can't just stack anything on top of anything else. So, there's a certain learning curve involved in understanding how the layers interact. I think a better approach is a more functional approach, utility methods for things like splitting and joining buffers, sequencing, sending via various types of sockets etc. There is actually another paper out there which talks about this. Here's a link to it: http://portal.acm.org/citation.cfm?doid=323647.323645. I found this paper very helpful. JGroups has numerous configurations out of the box, all of which can be found in the conf directory of the JGroups source from CVS. One of the difficult things about JGroups in my opinion is that there are so many different configuration possibilities. It's a bit difficult to know exactly what to use. Though, for some I'm sure that flexibility is a nice "feature". All personal taste I suppose.
To specifically understand Horus vs. Ensemble, read this: http://dsl.cs.technion.ac.il/projects/Ensemble/overview.html There is a good paper which compares a number of the major group communication toolkits available. Here's a link: http://portal.acm.org/citation.cfm?doid=503112.503113 There's another paper which I believe is similar. Though I haven't yet had a chance to read it. Here's a link to that one: http://portal.acm.org/citation.cfm?doid=1041680.1041682 Hope this all helps. - Laran
![](https://secure.gravatar.com/avatar/d7875f8cfd8ba9262bfff2bf6f6f9b35.jpg?s=120&d=mm&r=g)
On Fri, 2005-05-27 at 13:53 -0400, Laran Evans wrote:
The IETF reliable mulitcast working group has an overview of the different dimensions and building blocks, specs for some of the different types, etc.: http://www.ietf.org/html.charters/rmt-charter.html
![](https://secure.gravatar.com/avatar/cc73f0ce7066be790eeeb1fe2d38a5ac.jpg?s=120&d=mm&r=g)
On 5/26/05, Laran Evans <lc278@cornell.edu> wrote:
We are working on something similar for a project at work also. We are using ZeroConf ( aka formerly Apple Rendezvous now Apple Bonjour ) for cluster discovery and a Perspective Broker interface for exchanging information. We considered reliable multi-cast but decided against it because our clusters are not going to be very big and are going to use a kind of inversion of control pattern instead of distributing all the data everywhere. But I am interested in a COMMON IDIOM for clustering twisted servers, I would definately say ZeroConf needs to be the discovery mechanism, no need to re-invent that wheel and a Twisted implemenation is easy -- If you don't know what you want, you probably need a nap.
![](https://secure.gravatar.com/avatar/cbf5bb239a63953178ad80dcb79c4193.jpg?s=120&d=mm&r=g)
jarrod roberson wrote:
From what I've read about ZeroConf it seems like a LAN-only discovery mechanism. For clustering a web-server LAN only is fine. But there are certainly cases where it would be valuable to integrate WAN nodes. If I were to build a toolkit I would want it to be able to run optimally on a LAN while maintaining the ability to achieve reliability and high performance on a WAN as well. The folks at the AccessGrid at least would I'm sure appreciate that ability.
The keyword in reliable multicast is "reliable". As soon as you have 2 processes that you need to keep in sync, some form of reliable communication is necessary. And by virtue of the fact that there is more than one process, sending messages falls under the title of multi-cast. So, I'm very interested to know how you integrated the IOC pattern to achieve reliability. This sounds a bit like a push vs. pull paradigm. If that's the case I can point you to a paper which discusses the achievable properties of push, pull and push-pull scenarios.
ZeroConf does definitely look solid. And I'm 100% with you on not-reinventing the wheel. So I'll finish reading some of the ZeroConf specs to see what it can do. - Laran
![](https://secure.gravatar.com/avatar/d7875f8cfd8ba9262bfff2bf6f6f9b35.jpg?s=120&d=mm&r=g)
On Thu, 2005-05-26 at 12:10 -0400, Laran Evans wrote:
Has any work been done to enable process group communication similar to what Spread (spread.org), Ensemble, Horus, Isis, Eternal etc. offer?
I've built a reliable multicast (1->N ordered reliable message delivery, NACK-based, with congestion control) system for work using Twisted and the Fusion C++ Twisted bindings (http://itamarst.org/software/), but it is not open source. -- Itamar Shtull-Trauring <itamar@itamarst.org> http://itamarst.org
![](https://secure.gravatar.com/avatar/fe2b2c6cf2a61cd802e9d8233dc17ff7.jpg?s=120&d=mm&r=g)
On 27/05/2005, at 2:34 AM, Laran Evans wrote:
Useful? Wow, this is *exactly* what we are after for use within the Access Grid (www.accessgrid.org). This community has invested a lot of time in getting multicast connectivity between the 200+ sites currently involved. Ie., we've got the network backbone. At the moment we are only using multicast for the RTP media transport (the old mbone tools vic and rat, the newer VideoPresence tool VP). It would be great though to have a framework reliable message delivery over multicast for some of the lighter weight collaborative tools we are building. At the moment, AccessGrid uses a centralized VenueServer and TCP for coordinating sessions. Cheers, darran. Darran Edmundson (darran.edmundson@anu.edu.au) ANU Supercomputer Facility Vizlab Australian National University, Canberra, ACT 2600 tel: +61 2 6125-0517 fax: +61 2 6125-5088
![](https://secure.gravatar.com/avatar/cbf5bb239a63953178ad80dcb79c4193.jpg?s=120&d=mm&r=g)
Hmmm. Well, it sounds like there's a bit of interest here. I'm game for starting a discussion about what such a toolkit might look like if anyone else is interested. There are plenty of toolkits that do this sort of thing. But I'm interested in building one from scratch. And it seems like there's already a lot to Twister to plug into. If any of the Twister admins have any suggestions, objections or otherwise they would be appreciated. If you're not interested in building something like this into Twister it can always be done in parallel. - Laran Darran Edmundson wrote:
-- - Laran Evans
![](https://secure.gravatar.com/avatar/fe2b2c6cf2a61cd802e9d8233dc17ff7.jpg?s=120&d=mm&r=g)
On 27/05/2005, at 11:52 AM, Laran Evans wrote:
Our guys here at the ANU Supercomputer Facility are using "MCLv3: an Open Source Implementation of the ALC and NORM Reliable Multicast Protocols": http://www.inrialpes.fr/planete/people/roca/mcl/ldpc_infos.html in their software for replication of kernel images out to the 1600 processors on our new SGI Altix system. They are really happy with the library. On my personal to-do list is creating python bindings for MCLv3 which I then planned to use with twisted for some Access Grid work ... Cheers, darran. Darran Edmundson (darran.edmundson@anu.edu.au) ANU Supercomputer Facility Vizlab Australian National University, Canberra, ACT 2600 tel: +61 2 6125-0517 fax: +61 2 6125-5088
![](https://secure.gravatar.com/avatar/fe2b2c6cf2a61cd802e9d8233dc17ff7.jpg?s=120&d=mm&r=g)
On 27/05/2005, at 12:46 PM, Laran Evans wrote:
Is MCLv3 also being used for the accessgrid work?
No, that's what I want to implement. In the current model, the VenueServer acts as a common data store. Fine for little things but obviously doesn't allow impromptu "ah, here's a 100Mb dataset for everyone to preview".
I'm curious. Why was MCLv3 chosen?
By virtue of being the first library they found that fulfilled their needs ;-) Our guys have a 400Mb compressed node image that the want to distribute. It is transmitted in 1k packets (FEC algorithm takes a list of pointers so no need to actually break the data up), each 32Mb chunk being verified against a SHA1 hash. Note, there is *no* feedback from the clients, we just set a duplication factor of 1.5, a number empirically determined to give reliable transfer on the local network. Cheers, Darran.
![](https://secure.gravatar.com/avatar/3a301d7ca78eb4da89bce878608add09.jpg?s=120&d=mm&r=g)
What would be the steps involved in integrating a few of the various protocols involved here. Is there a chance of having a common API, or do the "large blocks on the internet" variants need something different than something optimized for "small blocks on the lan". I don't have a really good notion of all these things, but I'm considering trying to integrate the spread.org stuff into twisted to achieve clustering of servers into a logical virtual server, for redundancy. There might be a simpler way to do the same thing. I'm still exploring. Does anyone know of an example API for integrating several of the protocols into the same framework? How do these types of protocols break down into categories? --Ian
![](https://secure.gravatar.com/avatar/cbf5bb239a63953178ad80dcb79c4193.jpg?s=120&d=mm&r=g)
Well, there are a couple of things to consider. Here are a few. 1/. How many processes in the group. Most protocols work well in groups up to a few hundred. When you get over a thousand or so things start to get complicated. And above that there are protocols which use epidemic techniques, so-called "gossip" or "rumor"-based protocols. Bimodal multicast is one such protocol. 2/. Network environment. Protocols that run exclusively on a LAN can make certain assumptions that WAN-inclusive protocols can't. So, some protocols shift things around a bit, intentionally slowing things down to accomodate the WAN nodes. So the whole system runs some factor slower, but reliability is maintained. Beyond that there are all sorts of techniques, each of which makes certain assumptions. So, which protocol to utilize depends on what type of environment you want to create for applications. Spread's definitely a solid choice. JGroups, Ensemble, Horus, Isis, Totem, Transis and Eternal are all top-notch as well, each of which provides slightly different properties. There has been some talk lately on the JGroups mailing list about adding python extensions to JGroups. JGroups is what's used for clustering in Tomcat and JBoss I believe. Though I could be wrong about that. There's one design in which I'm quite interested which was initially proposed for an international stock exchange. You can read about that protocol here: http://portal.acm.org/citation.cfm?doid=380749.380771. This particular design is specifically designed to work on the Internet. And is based on TRMP. So, I hope this helps. - Laran Ian Duggan wrote:
-- - Laran Evans
![](https://secure.gravatar.com/avatar/3a301d7ca78eb4da89bce878608add09.jpg?s=120&d=mm&r=g)
Well, there are a couple of things to consider. Here are a few.
Thanks, this is great.
I just took a peek at JGroups. It looks like their API includes a lot of switches you can tweak. Would that be a good starting point? It looks like it's used in some significant stuff. http://www.jgroups.org/javagroupsnew/docs/success.html Their overview page (http://www.jgroups.org/javagroupsnew/docs/overview.html) makes a decent case for their approach. I wonder if doing a mixin/layered approach to building protocols adds much overhead? Transport protocols: UDP (IP Multicast), TCP, JMS Fragmentation of large messages Reliable unicast and multicast message transmission. Lost messages are retransmitted Failure detection: crashed members are excluded from the membership Ordering protocols: Atomic (all-or-none message delivery), Fifo, Causal, Total Order (sequencer or token based) Membership Encryption As for the others, I would need to read more about them. Do you know of any sort of comparison matrix that includes these projects? Ensemble is work that happened after Horus, right? Does that means it's more complete/featureful and should be considered over Horus? Ie, has Horus added new things since they split? --Ian
![](https://secure.gravatar.com/avatar/cbf5bb239a63953178ad80dcb79c4193.jpg?s=120&d=mm&r=g)
Ian Duggan wrote:
There is a paper out there that makes the statement that each layer you add to a system adds 10% overhead. So, by that rationale, stacking protocols is scalable only to a certain point. But, what that paper's really referring to is adding a programming language on top of the underlying hardware adds 10%, which is pretty silly because, yeah, you have to add overhead to provide a whole new set of capabilities. But without the capabilities provided by the programming language, your underlying hardware wouldn't be worth much. So, I don't think layering protocols adds much overhead at all. But there are better ways and not so good ways to do it. And I'm not a huge fan of the way JGroups and in fact Horus and Ensemble do it, basically the lego block approach because a given layer can only be stacked on top or below certain other layers. You can't just stack anything on top of anything else. So, there's a certain learning curve involved in understanding how the layers interact. I think a better approach is a more functional approach, utility methods for things like splitting and joining buffers, sequencing, sending via various types of sockets etc. There is actually another paper out there which talks about this. Here's a link to it: http://portal.acm.org/citation.cfm?doid=323647.323645. I found this paper very helpful. JGroups has numerous configurations out of the box, all of which can be found in the conf directory of the JGroups source from CVS. One of the difficult things about JGroups in my opinion is that there are so many different configuration possibilities. It's a bit difficult to know exactly what to use. Though, for some I'm sure that flexibility is a nice "feature". All personal taste I suppose.
To specifically understand Horus vs. Ensemble, read this: http://dsl.cs.technion.ac.il/projects/Ensemble/overview.html There is a good paper which compares a number of the major group communication toolkits available. Here's a link: http://portal.acm.org/citation.cfm?doid=503112.503113 There's another paper which I believe is similar. Though I haven't yet had a chance to read it. Here's a link to that one: http://portal.acm.org/citation.cfm?doid=1041680.1041682 Hope this all helps. - Laran
![](https://secure.gravatar.com/avatar/d7875f8cfd8ba9262bfff2bf6f6f9b35.jpg?s=120&d=mm&r=g)
On Fri, 2005-05-27 at 13:53 -0400, Laran Evans wrote:
The IETF reliable mulitcast working group has an overview of the different dimensions and building blocks, specs for some of the different types, etc.: http://www.ietf.org/html.charters/rmt-charter.html
![](https://secure.gravatar.com/avatar/cc73f0ce7066be790eeeb1fe2d38a5ac.jpg?s=120&d=mm&r=g)
On 5/26/05, Laran Evans <lc278@cornell.edu> wrote:
We are working on something similar for a project at work also. We are using ZeroConf ( aka formerly Apple Rendezvous now Apple Bonjour ) for cluster discovery and a Perspective Broker interface for exchanging information. We considered reliable multi-cast but decided against it because our clusters are not going to be very big and are going to use a kind of inversion of control pattern instead of distributing all the data everywhere. But I am interested in a COMMON IDIOM for clustering twisted servers, I would definately say ZeroConf needs to be the discovery mechanism, no need to re-invent that wheel and a Twisted implemenation is easy -- If you don't know what you want, you probably need a nap.
![](https://secure.gravatar.com/avatar/cbf5bb239a63953178ad80dcb79c4193.jpg?s=120&d=mm&r=g)
jarrod roberson wrote:
From what I've read about ZeroConf it seems like a LAN-only discovery mechanism. For clustering a web-server LAN only is fine. But there are certainly cases where it would be valuable to integrate WAN nodes. If I were to build a toolkit I would want it to be able to run optimally on a LAN while maintaining the ability to achieve reliability and high performance on a WAN as well. The folks at the AccessGrid at least would I'm sure appreciate that ability.
The keyword in reliable multicast is "reliable". As soon as you have 2 processes that you need to keep in sync, some form of reliable communication is necessary. And by virtue of the fact that there is more than one process, sending messages falls under the title of multi-cast. So, I'm very interested to know how you integrated the IOC pattern to achieve reliability. This sounds a bit like a push vs. pull paradigm. If that's the case I can point you to a paper which discusses the achievable properties of push, pull and push-pull scenarios.
ZeroConf does definitely look solid. And I'm 100% with you on not-reinventing the wheel. So I'll finish reading some of the ZeroConf specs to see what it can do. - Laran
participants (5)
-
Darran Edmundson
-
Ian Duggan
-
Itamar Shtull-Trauring
-
jarrod roberson
-
Laran Evans