[Athena] Is ReliableMessageDelivery really necessary?
Hi, I've hit a problem as my app has got bigger (about 30-40 widgets now, all chattering roughly once every 2 seconds) where the reliable message delivery mechanism is spiralling out of control. It seems that the constant back and forth means that large 'baskets' of messages are resent. The more this happens, the busier everything gets until the browser becomes unresponsive. There's a fix for it: [Divmod-dev] athena duplicate messages issue but I'm slightly concerned about the potential for lost messages - and also confused about how this could happen. Given that HTTP is a reliable connection-oriented transport, where is the gap that messages can fall through? I think I can cope with lost messages in most cases, so would it be useful to add a kind of 'sendRemote' that was like 'callRemote' but didn't care about a response? Or maybe this already exists and I've missed it? Paul. P.S. this app is likely to get more noisy - is it likely that I'll have to abandon Athena for Orbited or similar? I mean, are there architectural differences that will prevent Athena scaling?
On Wed, 1 Jul 2009 11:15:35 +0100, Paul Thomas <spongelavapaul@googlemail.com> wrote:
Hi,
I've hit a problem as my app has got bigger (about 30-40 widgets now, all chattering roughly once every 2 seconds) where the reliable message delivery mechanism is spiralling out of control. It seems that the constant back and forth means that large 'baskets' of messages are resent. The more this happens, the busier everything gets until the browser becomes unresponsive.
If you can produce a minimal example which demonstrates this behavior, it would probably be very helpful in improving the situation.
There's a fix for it: [Divmod-dev] athena duplicate messages issue but I'm slightly concerned about the potential for lost messages - and also confused about how this could happen. Given that HTTP is a reliable connection-oriented transport, where is the gap that messages can fall through?
Actually, HTTP is not a reliable transport. The most obvious shortcoming it has is that there is no way for a server to know if a client received a response or not, but there are others. So ReliableMessageDelivery is necessary.
I think I can cope with lost messages in most cases, so would it be useful to add a kind of 'sendRemote' that was like 'callRemote' but didn't care about a response? Or maybe this already exists and I've missed it?
This is an interesting idea. I haven't considered having such a feature in Athena before. It may be worth exploring. The first problem that comes to mind is that if any part of a page uses callRemote, sendRemote's advantages are largely lost. This would be because the messages generated by callRemote will still need to be sent, so whatever retransmission logic is present in ReliableMessageDelivery will still be invoked.
Paul.
P.S. this app is likely to get more noisy - is it likely that I'll have to abandon Athena for Orbited or similar? I mean, are there architectural differences that will prevent Athena scaling?
I certainly hope that Athena can handle whatever load you intended to put on it, or that we can work together to fix whatever problems it has which would prevent that. :) Jean-Paul
On 1 Jul 2009, at 22:45, Jean-Paul Calderone wrote:
On Wed, 1 Jul 2009 11:15:35 +0100, Paul Thomas <spongelavapaul@googlemail.com
wrote: Hi,
I've hit a problem as my app has got bigger (about 30-40 widgets now, all chattering roughly once every 2 seconds) where the reliable message delivery mechanism is spiralling out of control. It seems that the constant back and forth means that large 'baskets' of messages are resent. The more this happens, the busier everything gets until the browser becomes unresponsive.
If you can produce a minimal example which demonstrates this behavior, it would probably be very helpful in improving the situation.
I've been tasked with doing this anyway to help us evaluate other solutions. I'm sure I can convince the boss to make it available.
There's a fix for it: [Divmod-dev] athena duplicate messages issue but I'm slightly concerned about the potential for lost messages - and also confused about how this could happen. Given that HTTP is a reliable connection-oriented transport, where is the gap that messages can fall through?
Actually, HTTP is not a reliable transport. The most obvious shortcoming it has is that there is no way for a server to know if a client received a response or not, but there are others. So ReliableMessageDelivery is necessary.
Got it.
I think I can cope with lost messages in most cases, so would it be useful to add a kind of 'sendRemote' that was like 'callRemote' but didn't care about a response? Or maybe this already exists and I've missed it?
This is an interesting idea. I haven't considered having such a feature in Athena before. It may be worth exploring. The first problem that comes to mind is that if any part of a page uses callRemote, sendRemote's advantages are largely lost. This would be because the messages generated by callRemote will still need to be sent, so whatever retransmission logic is present in ReliableMessageDelivery will still be invoked.
Right - and I _would_ need both in the same page. Also, as Glyph points out, I wouldn't like out-of-order messages.
Paul.
P.S. this app is likely to get more noisy - is it likely that I'll have to abandon Athena for Orbited or similar? I mean, are there architectural differences that will prevent Athena scaling?
I certainly hope that Athena can handle whatever load you intended to put on it, or that we can work together to fix whatever problems it has which would prevent that. :)
We'll be doing an evaluation soon. Performance will play a part but we also have to consider integration with UI toolkits (jQuery UI etc.). If we do stick with Athena, we'll be providing patches and tests. Thanks to you both, paul.
Hi, as you can see here: http://www.divmod.org/trac/browser/trunk/Nevow/nevow/athena.py#L800 , if incomingMessages is a very, very long list, chances are, that it will block for a small amount of time. Also, if you happen to run the web server interactively, calling a lot of log.msg(.) functions will print out a lot of stuff on the console, which will also take some CPU cycles and GUI resources[1]. I think the best approach would be first to remove lines 803 - 806 ("else: log.msg(.)") and test if the problem of browser lagging still exists - of course if you run server and browser on the same machine. There are some cases, which may lead to accumulation of the messages on client, which then lead to lags on the client side (and a lots of log messages on the server). Those may be bugs in the application code, but I think some of those cases are browser-dependent too (for some code I tested, Google Chrome sent a lot of duplicate messages, while Firefox sent none). Some people say, that duplicate messages phenomenon may be even extension-dependent, as in case of Firefox with FireBug enabled. Unfortunately, it seems that none of us is able to provide a minimal code sample of described misbehavior. Maybe it would be better just to disable those log messages in the release builds or make them optional? -- M Ad. 1 - . and, even if you log web server output to a file, in this case, duplicate message logs may take some space.
On 4 Jul 2009, at 08:34, Michał Pasternak wrote:
Hi,
as you can see here: http://www.divmod.org/trac/browser/trunk/Nevow/nevow/athena.py#L800 , if incomingMessages is a very, very long list, chances are, that it will block for a small amount of time. Also, if you happen to run the web server interactively, calling a lot of log.msg(…) functions will print out a lot of stuff on the console, which will also take some CPU cycles and GUI resources[1]. I think the best approach would be first to remove lines 803 – 806 (“else: log.msg(…)”) and test if the problem of browser lagging still exists – of course if you run server and browser on the same machine.
There are some cases, which may lead to accumulation of the messages on client, which then lead to lags on the client side (and a lots of log messages on the server). Those may be bugs in the application code, but I think some of those cases are browser-dependent too (for some code I tested, Google Chrome sent a lot of duplicate messages, while Firefox sent none). Some people say, that duplicate messages phenomenon may be even extension-dependent, as in case of Firefox with FireBug enabled.
Unfortunately, it seems that none of us is able to provide a minimal code sample of described misbehavior. Maybe it would be better just to disable those log messages in the release builds or make them optional?
I'll certainly look into this, but I believe that the system causing concern was run without a controlling terminal - and I have a suspicion that the offending log was even removed. The system is IO intensive though (hence using twisted in the first place) and we have had instances in the past of accidentally using blocking calls. I'll report back in a few weeks if we get some real tests going.
On 1 Jul 2009, at 22:45, Jean-Paul Calderone wrote:
On Wed, 1 Jul 2009 11:15:35 +0100, Paul Thomas <spongelavapaul@googlemail.com
wrote: Hi,
I've hit a problem as my app has got bigger (about 30-40 widgets now, all chattering roughly once every 2 seconds) where the reliable message delivery mechanism is spiralling out of control. It seems that the constant back and forth means that large 'baskets' of messages are resent. The more this happens, the busier everything gets until the browser becomes unresponsive.
If you can produce a minimal example which demonstrates this behavior, it would probably be very helpful in improving the situation.
It's been quite some time, but I may have time to look into this soon (meaning in the next few months). I have a minimal example code, but it's a bit big - does this need to be in a ticket somewhere? Paul. 8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<----- // resources/css/test1.css #workspace > div { display: inline-block; margin: 10px; padding: 5px; border: 1px solid black; } #workspace > div p { margin: 0; padding: 0; font-size: 10pt; } h3 { font-size: 12pt; margin: 0; padding: 0; } 8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<----- // resources/js/widgets/Status.js // import Divmod // import Nevow.Athena Nevow.Athena.Widget.subclass(Status, 'SimpleStatus').methods( function __init__(self, node) { Status.SimpleStatus.upcall(self, '__init__', node); self.contentP = node.getElementsByTagName("p")[0]; self.count = 0 }, function update(self, count) { self.contentP.innerHTML = "T: " + count; if( count != self.count + 1 ){ self.node.style.background = "red"; } self.count = count } ); 8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<----- // test1.py from os import path from logging import getLogger, basicConfig, INFO, DEBUG, WARN from random import random from twisted.internet import reactor from twisted.python.log import ILogObserver, PythonLoggingObserver from nevow.athena import LivePage, LiveElement, AutoJSPackage, expose, renderer from nevow import loaders, tags as T, static #from guppy import hpy #h = hpy() basicConfig(level=DEBUG) #heap_log = getLogger('heap') #getLogger('twisted').setLevel(WARN) def heap_poll(): heap_log.debug("==== HEAP =====") heap_log.debug(h.heap()) reactor.callLater(5, heap_poll) #reactor.callWhenRunning(heap_poll) _RESOURCE_DIR = path.join(path.dirname(path.abspath(__file__)), 'resources') log = getLogger('comet-test1') class SimpleStatus(LiveElement): jsClass = u'Status.SimpleStatus' docFactory = loaders.stan(T.div(render=T.directive('liveElement'), class_='widget')[ T.div(class_='widget-header', render=T.directive('title')), T.div(class_='widget-editbox'), T.div(class_='widget-content')[ T.p['None']]]) def __init__(self, id): super(SimpleStatus, self).__init__() self.num = id self.count = 0 reactor.callLater(int(random() * 2 + 1), self._poll) def _poll(self): self.count += 1 self.callRemote('update', self.count) reactor.callLater(random() * 2 + 1, self._poll) @renderer def title(self, req, tag): return tag[T.h3['Widget %d' % self.num]] @expose def getSomething(self, name): log.debug('getSomething(%r)' % name) return 'Not sure' class Root(LivePage): docFactory = loaders.stan( T.html[ T.head(render=T.directive('liveglue'))[ T.title['Comet Test 1'], T.link(rel='stylesheet', type='text/css', href='css/test1.css')], T.body[ T.div(id='workspace', class_='widget-place')[ [T.div(render=T.directive('simpleStatus'))] * 50 ]]]) addSlash = True children = { 'css': static.File(path.join(_RESOURCE_DIR,'css')), 'js': static.File(path.join(_RESOURCE_DIR,'js'))} def __init__(self): super(Root, self).__init__() self.jsModules.mapping.update(AutoJSPackage(path.join( _RESOURCE_DIR,'js','widgets')).mapping) log.info(str(self.jsModules.mapping)) self.next_id = 1 def child_(self, ctx): return Root() def render_simpleStatus(self,ctx,data): f = SimpleStatus(self.next_id) self.next_id += 1 f.setFragmentParent(self) return ctx.tag[f] from nevow import appserver from twisted.application import service, internet site = appserver.NevowSite(Root()) application = service.Application("Test 1") webService = internet.TCPServer(8080, site) webService.setServiceParent(application) application.setComponent(ILogObserver, PythonLoggingObserver().emit)
On 7 Dec, 10:16 pm, spongelavapaul@googlemail.com wrote:
On 1 Jul 2009, at 22:45, Jean-Paul Calderone wrote:
On Wed, 1 Jul 2009 11:15:35 +0100, Paul Thomas <spongelavapaul@googlemail.com
wrote: Hi,
I've hit a problem as my app has got bigger (about 30-40 widgets now, all chattering roughly once every 2 seconds) where the reliable message delivery mechanism is spiralling out of control. It seems that the constant back and forth means that large 'baskets' of messages are resent. The more this happens, the busier everything gets until the browser becomes unresponsive.
If you can produce a minimal example which demonstrates this behavior, it would probably be very helpful in improving the situation.
It's been quite some time, but I may have time to look into this soon (meaning in the next few months).
I have a minimal example code, but it's a bit big - does this need to be in a ticket somewhere?
A ticket would be excellent. Also, don't be afraid of a tar or zip. :) Jean-Paul
On 10:15 am, spongelavapaul@googlemail.com wrote:
I've hit a problem as my app has got bigger (about 30-40 widgets now, all chattering roughly once every 2 seconds) where the reliable message delivery mechanism is spiralling out of control. It seems that the constant back and forth means that large 'baskets' of messages are resent. The more this happens, the busier everything gets until the browser becomes unresponsive.
This is unfortunate, but I'm sure it's fixable. At least, partially. Client-server communication, especially in JavaScript, isn't free.
There's a fix for it: [Divmod-dev] athena duplicate messages issue but I'm slightly concerned about the potential for lost messages - and also confused about how this could happen. Given that HTTP is a reliable connection-oriented transport, where is the gap that messages can fall through?
HTTP is neither reliable nor connection-oriented :). TCP is reliable and connection-oriented, but HTTP builds on top of it to produce something which is neither. "reliable" in this case doesn't mean that the transport is perfect and will deliver everything, but that if you send messages "1, 2, 3", you will get messages "1, 2, 3" in that order or you will get nothing at all. (Of course you may also get just "1", or "1, 2", but you will never get "3, 1, 2".) Even if HTTP had a way to initiate the delivery of a message over a channel that was already busy receiving the response to another message (it doesn't) we'd have to contend with the browser APIs for issuing HTTP requests, which leave out significant portions of the actual protocol. For example, browser javascript may never issue more than two concurrent requests to the same host, since the spec says that's all that you can do. So, what is happening here is that have Nevow attempts to implement a protocol in terms of HTTP messages as individual, unreliable messages, which may be eaten by beasts like transparent proxies and browser runtime bugs, and present to your application a stream of messages which are always in order and never dropped. This is, as it happens, *exactly* what Orbited does, and Nevow could potentially be implemented on top of Orbited. However, Nevow's implementation has a bug, and over- zealously re-delivers messages, when frequently re-delivery is not required. This is rarely a problem except for the noise that it generates in your log files and the performance problems that it creates, which you've noticed, if your message queue starts to back up. So, my suggestion to you would be to read through the relevant JavaScript code for delivering "baskets" to the server, and try to figure out what exactly is happening, and write a patch to correct this behavior. It's not trivial, but it's not rocket science either. If I recall correctly, the problem is that the client will overzealously interrupt its own connection to the server where it is sending a basket of collected messages, in order to free up the HTTP connection to send a *new* message which it has generated. It would be better if the client would allow for a brief (and actually "brief" probably needs to be pretty long, in the wild) grace period to allow the HTTP request to be fully received and responded to before piling on more work. Part of the problem here, of course, is that the crappy JavaScript browser HTTP API won't let us tell how much of our request has been uploaded or process the response as it arrives. So we have to guess what a reasonable timeout would be, rather than have the algorithm operate on actual data. In other words, you're right: the messages are not actually disappearing into a black hole :). As far as what you should do: I think you should try to write a patch. It's not trivial, but it's not rocket science either: it's just computer science. Hopefully my description of the problem is accurate enough to get you started; I'm sure that if you ask for help on this list or on IRC as you're working on it, you will find no shortage of it. Lots of people have reported this problem over the years but nobody has (as far as I can tell from searching right now) thought to even report the bug as a ticket on divmod.org, let alone contribute a fix for it.
I think I can cope with lost messages in most cases, so would it be useful to add a kind of 'sendRemote' that was like 'callRemote' but didn't care about a response? Or maybe this already exists and I've missed it?
Could you cope with these messages arriving arbitrarily out of order? I am willing to bet not; it would just make your application extremely difficult to test, and it would start spewing exceptions when it started to get more heavily loaded, rather than making the browser unresponsive.
P.S. this app is likely to get more noisy - is it likely that I'll have to abandon Athena for Orbited or similar? I mean, are there architectural differences that will prevent Athena scaling?
participants (5)
-
exarkun@twistedmatrix.com
-
glyph@divmod.com
-
Jean-Paul Calderone
-
Michał Pasternak
-
Paul Thomas