[Python-Dev] Pythonic concurrency

Fri Oct 7 23:02:38 CEST 2005

[ Possibly overlengthy reply. However given a multiple sets of cans of
  worms... ]
On Friday 07 October 2005 07:25, Josiah Carlson wrote:
> One thing I notice is absent from the Kamaelia page is benchmarks.

That's largely for one simple reason: we haven't done any yet. 

At least not anything I'd call a benchmark. "There's lies, damn lies,
statistics and then there's benchmarks."

//Theoretically// I suspect that the system /could/ perform as well as
traditional approaches to dealing with concurrent problems single threaded 
(and multi-thread/process). This is based on the recognition of two things:

    * Event systems (often implementing state machine type behaviour, not
      always though), often have intermediate buffers between states &
      operations. Some systems divide a problem into multiple reactors and
      stages and have communication between them, though this can sometimes
     be hidden. All we've done is make this much more explicit.

   * Event systems (and state machine based approaches) can often be used
      to effectively say "I want to stop and wait here, come back to me later"
      or simply "I'm doing something processor intensive, but I'm being nice
      and letting something else have a go". The use of generators here simply
      makes that particular behaviour more explicit. This is a nice bonus of
      python.

[neither is a negative really, just different. The first bullet has implicit
 buffers in the system, the latter has a more implicit state machine
 in the system. ICBVW here of course.]

However, COULD is not is, and whilst I say "in theory", I am painfully aware 
that theory and practice often have a big gulf between them.

Also, I'm certain that at present our performance is nowhere near optimal.
We've focussed on trying to find what works from a few perspectives rather
than performance (one possible definition of correctness here, but certainly
not the only one). Along the way we've made compomises in favour of clarity
as to what's going on, rather than performance.

For example, one are we know we can optimise is the handling of
message delivery. The mini-axon tutorial represents delivery between
active components as being performed by an independent party - a
postman. This is precisely what happens in the current system.

That can be optimised for example by collapsing outboxes into inboxes
(ie removing one of the lists when a linkage is made and changing the
refernce), and at that point you have a single intermediate buffer (much
like an event/state system communicating between subsystems). We haven't
done this yet, Whilst it would partly simplify things, it makes other
areas more complex, and seems like premature optimisation.

However I have performed an //informal comparison// between the use of a 
Kamaelia type approach and a traditional approach not using any framework at 
all for implementing a trivial game. (Cats bouncing around the screen 
scaling, rotating, etc, controlled by a user) The reason I say Kamaelia-type 
approach is because it was a mini-axon based experiment using collapsed 
outboxes to inboxes (as above).

The measure I used was simply framerate. This is a fair real value and has a 
real use - if it drops too low, the system is simply unusable. I measured the 
framerate before transforming the simplistic game to work well in the 
framework, and after transforming it. The differences were:
   * 5% drop in performance/framerate
   * The ability to reuse much of the code in other systems and environments.

From that perspective it seems acceptable (for now). This *isn't* as you would
probably say a rigorous or trustable benchmark, but was a useful "smoke test"
if you like of the approach.

From a *pragmatic* perspective, currently the system is fast enough for simple 
games (say a hundred, 2 hundred, maybe more, sprites actve at once),
for interactive applications, video players, realtime audio mixing and a
variety of other things, so currently we're leaving that aside.

Also from an even more pragmatic perspective, I would say if you're after 
performance and throughput then I'd say use Twisted, since it's a proven 
technology.

**If** our stuff turns out to be useful, we'd like to find  way of making our
stuff available inside twisted -- if they'd like it (*) --  since we're not 
the least bit interested in competing with anyone :-) So far *we're* finding 
it useful, which is all I'd personally claim, and hope that it's useful to 
others.
   (*) The all too brief conversation I had with Tommi Virtanen at Europython
       suggested that he at least thought the pipeline/graphline idea was
       worth taking - so I'd like to do that at some point, even if it 
       sidelines our work to date.

Once we've validated the model though (which I expect to take some time,
you only learn if it's validated by builiding things IMO), then we'll look at
optimisation.  (if the model is validated :-)

All that said, I'm open to suggestion as to what sort of benchmark you'd like 
to see. I'm more interested in benchmarks that actually mean something rather 
than say X is better than Y though.

Summarising them, no benchmarks, yet. If you're after speed, I'm certain
you can find that elsewhere. If you're after an easy way of dealing with a
concurrent problem, that's where we're starting from, and then optimising. 
We're very open to suggestions for improvement on both usability/leanability 
and on keeping doors open/open doors to performance though. I'd hate to
have to rewrite everything in a another language later simply due to poor
design decisions. 

[ Network controlled Networked Audio Mixing Matrix ]
> Very neat.  How much data?  What kind of throughput?  What kinds of
> latencies?

For the test system we tested with 3 raw PCM audio data streams. That 's
3 x 44.1Khz, 16 bit stereo - which is around 4.2Mbit/s of data from the
network being processed realtime and output back to the network at
1.4Mbit/s. So, not huge numbers, but not insignificant amounts of data
either. I suppose one thing I can take more time with now is to look at
the specific latency of the mixer. It didn't *appear* to be large however.
(there appeared to be similar latency in the system with or without the
mixer)

[[The aim of the rapid prototyping session was to see what could be done
  rather than to measure the results. The total time taken for coding the
  mixing matrix was 2.5 days. About 1/2 day spent on finding an issue we had
  with network resends regarding non-blocking sockets. A day with me totally
  misunderstanding how mixing raw audio byte streams works. The backplane
  was written during that 3 day time period. The control protocol for
  switching on/off mixes and querying the system though was ~1.5 hours
  from start to finish, including testing. To experiment with what dataflow
  architecture might work, I knocked up a command line controlled dynamic
  graph viewer (add nodes, link nodes, delete nodes) in about 5 minutes and
  then experimented with what the system would look like if done naively. The
   backplane idea became clear as useful here because we wanted to allow
  multiple mixers. ]]

A more interesting effect we found was dealing with mouse movement in pygame
where we found that *huge* numbers of messages being sent one at a time and
processed one at a time (with yields after each) became a huge bottleneck.

It became more sense to batch the events and pass them to client surfaces.
(If that makes no sense we allow pygame components to act as if they have
control of the display by giving them a surface from a pygame display service.
This acts essentially as a simplistic window manager. That means pygame events 
need to be passed through quickly and cleanly.)

The reason I like using pygame for these things is because a) it's relatively 
raw and fast b) games are another often /naturally/ concurrent system. Also 
it normally allows other senses beyond reading numbers/graphs to kick in when 
evaluating changes "that looks better/worse", "Theres's something wrong 
there".

> I have two recent posts about the performance and features of a (hacked
> together) tuple space system 

Great :-) I'll have a dig around.

> The only thing that it is missing is a prioritization mechanism (fifo,
> numeric priority, etc.), which would get us a job scheduling kernel. Not
> bad  for a "message passing"/"tuple space"/"IPC" library.   

Sounds interesting. I'll try and find some time to have a look and have a 
play. FWIW, we're also missing a prioritisation mechanism right now. Though 
currently I have SImon Wittber's latest release of Nanothreads on my stack of 
to look at. I do have a soft spot for Linda type approaches though :-)

Best Regards,

Michael.
-- 
Michael Sparks, Senior R&D Engineer, Digital Media Group
Michael.Sparks at rd.bbc.co.uk, http://kamaelia.sourceforge.net/
British Broadcasting Corporation, Research and Development
Kingswood Warren, Surrey KT20 6NP

This e-mail may contain personal views which are not the views of the BBC.