
Chaz. wrote:
I started out using Spread some time ago (more than 2 years ago). The implementation was limited to a hundred or so nodes (that is in the notes on the spread implementation). Secondly it isn't quite so lightweight as you think (I've measured the performance).
It is a very nice system but when it gets to 1000s of machines very little work has been done on solving many of the problems. My research on it goes back almost a decade starting out with Horus.
I must admit to not having attempted to scale it that far, but I was under the impression that only the more expensive delivery modes were that costly. But by the sounds of it, you don't need me to tell you that.
Actually I am part of the IRTF group on P2P, E2E and SAM. I know the approaches they are being tossed about. I have tried to implement some of them. I just am not of the opinion that smart people can't find solutions to tough problems.
Ok, in which case my apologies. My reading of your posts had lead me to believe, incorrectly, you may not be familiar with the various issues. In that case, you can (should) disregard most of it.
Is multicast or broadcast the right way? I don't know, but I do know that without trying we will never know. Having been part of the IETF
It's clearly right for some things - I'm just not sure how much bi-directional distribution would be helped by it, since you've got at some point to get the replies back.
community for a lot of years (I was part of the group that worked on SNMP v1 and the WinSock standard), I know that when the "pedal meets the metal" sometimes you discover interesting things.
I didn't realise winsock went near the IETF. You learn something new every day.
Knuth and his comments on early optimisation apply here. Have you tried it? You might be surprised.
I am sorry to say I don't know the paper or research you are referring to. Can you point me to some references?
Sorry, it's a phrase from Donald Knuth's (excellent) three-volume programming book, "The Art of Computer Programming". Highly recommended.
Thanks for the information. This is what makes me think that I want something based on UDP and not TCP! And if I can do RMT (or some variant of it) I might be able to get better performance. But, as I said it is the nice thing about not having someone telling me I need to get a product out the door tomorrow! I have time to experiment and learn.
When I wrote my reply I hadn't seen your comment on the app being distributed storage.
How many calls per second are you doing, and approximately what volume of data will each call exchange?
This is information I can't provide since the system I have designing has no equivalent in the marketplace today (either commercial or open source). All I know is that the first version of the system I built - using C/C++ and a traditional architecture (a few dozens of machines) was able to handle 200 transactions/minute (using SOAP). While there were some "short messages" (less than an normal MTU), I had quite a few that topped out 50K bytes and some up to 100Mbytes.
Oh. In which case more or less everything I wrote is useless!
Successful transmission is really the easy bit for multicast. There is IGMP snooping, IGMP querier misbehaviour, loss of forwarding on an upstream IGP flap, flooding issues due to global MSDP issues, and so forth.
I agree about the successful transmission. You've lost me on the IGMP part. Can you elaborate as to your thoughts?
Well, my experience of large multicast IPv4 networks is that short interruptions in multicast connectivity are not uncommon. There are a number of reasons for this, which can be broadly broken down into 1st hop and subsequent hop issues. Basically, in a routed-multicast environment, I've seen the subnet IGMP querier (normally the gateway) get pre-empted by badly configured or plain broken OS stacks (e.g. someone running Linux with the IGMPv3 early patches). I've also seen confusion for highly-available subnets (e.g. VRRPed networks) where the IGMP querier and the multicast designated forwarder are different. This can cause issues with the IGMP snooping on the downstream layer2 switches when the DF is no longer on the path which the layer2 snooping builds. You also get issues with upstream changes in the unicast routing topology affecting PIM. Most of these are only issues with routed multicast. Subnet-local is a lot simpler, though you do still need an IGMP querier and switches with IGMP snooping.
I do know I need EXACTLY-ONCE semantics but how and where I implement them is the unknown. When you use TCP you assume the network provides the bulk of the solution. I have been thinking that if I use a less reliable network - one with low overhead - that I can provide the server part to do the EXACTLY-ONCE piece.
As to why I need EXACTLY-ONCE, well if I have to store something I know I absolutely need to store it. I can't be in the position that I don't know it has been stored - it must be there.
Thanks for the great remarks....I look forward to reading more.
This makes a lot more sense now I know it's storage related. You're right, this is a tricky and uncommon problem. Let me see if I've got this right: You're building some kind of distributed storage service. Clients will access the storage by a "normal" protocol to one of the nodes. Reads from the store are relatively easy, but writes to the store will need to be distributed to all or a subset of the nodes. Obviously you'll have a mix of lots of small writes and some very large writes. Hmm. Are you envisioning that you might have >1 storage set on the nodes, and using a different multicast group per storage set to build optimal distribution? You might be able to perform some tricks depending on whether this service provides block- or filesystem-level semantics. If it's the latter, you could import some techniques from the distributed version control arena - broadly speaking, node+version number each file and "broadcast" (in the application sense) just the file + newnode+newversion to the other store nodes, and have them lock the local copy and initiate a pull from the updated node. For block-level storage, that's going to be a lot harder. For the multicast, something like NORM, which as you probably know is basically forward-error-corrected transmit channel with receiver-triggered re-transmits, would probably work. An implementation would likely be non-trivial, but a fascinating project.