Distributed computing using SOAP. What about speed ?

Tim Daneliuk tundra at tundraware.com
Fri Jul 27 00:40:01 EDT 2001


A gracious Graham Dumpleton wrote:
<SNIP>
 
> I will look at your paper. You may want to look further at the Python
> manual for OSE and you might see it ain't quite what you first had in
> mind.

I guess I should clarify something.  In order to have "transactional
correctness" you have to have a reliable transport somewhere in the
stack below the transactional elements (or transport reliability has
to be synthesized at the transactional layer - a really bad idea).

So long as the synchronous semantics of the reliable (connection-oriented,
usually, although there have been attempts made at reliable datagrams)
transport are not exposed to the *distributed application*, we can do
what is needed.  For example, if the underlying transport is RPC over
http (as in SOAP), that's fine if:

1) Distributed apps don't see the RPC semantics and spin-lock waiting for
   calls to complete.

2) Transport reconnection is automatic and invisible to the application
   which sees the appearance of an apparently "always up, but not always
   fast" network underneath it.

3) All this layering of protocols is not excessively punative in performance.

4) The underlying speeds-n-feeds are up to the task of setting up and
   tearing down sessions for each object invocation.  For seriously big
   apps, they are NOT, so you have either session pool, or run dedicated
   sessions based on the applications topology and multiplex access 
   across those sessions.

In an ideal world, you's have a layering scheme something like this:


       Chunk of distributed application.
           |     and/or    |
-------------------------------------------
tx Queuing API | Direct Message Passing API
-------------------------------------------
tx Q Manager  M|           |              M
----------------           |
            Reliable Messaging Layer
--------------------------------------------         Where, "M" is a pluggable
            Reliable Comms Transport      M          management & security
--------------------------------------------         facility for each layer
             Speeds, Feeds, & Wire        M          in question
--------------------------------------------


(You'll notice that my picture is a bit different than the usual
one in which Q management lives *below* the messaging layer.  This
is a matter of taste because you can accomplish the same things
either way.  In any case, if you need message persistence as a 
class of service (almost all enterprises need this on occasion),
you will end up implementing a disk-based Q under the Messaging
Layer above and beyond what the Transactional Q Manager has to do.)

The problem is that everyone is more-or-less proposing that chunks of
a distributed application talk some simple API (.NET/J2EE) which connects
directly to the Reliable Comms Transport layers which is The Path To Hell
as I describe in the paper.

One other point.  The natural unit of distribution (how big a "chunk"
should be distributed) is NOT an object (except for object programmers ;).  
Distributing things at the granularity of an object is
just *begging* for trouble when you need to recover from a network
outage, a machine failure, or some other badness.  Distributed objects are
swell inside a small, contained, topology, but sending millions of
objects into orbit around the Net (as both .NET and EJB/J2EE/et al
propose) will make all our applications run with the reliability we've
come to know and love with Windows - i.e. prepare to "reboot" your
distributed applications components regularly.

A far better scheme is to first do a thorough Business Process Flow
analysis and let that suggest what the major unit is applications
*functionality* might be.  Then distribute at the more coarse-grained
"fat component = several thousand objects" level.  This is how we
essentially divided up what was the largest commercial TX processing
system about 20 years ago (airline) - no objects, but the idea was the same. 
We went from one very saturated giant mainframe to 7 mainframes (barely
keeping up ;).  We distributed at the very coarse-grained level of
"This machine does domestic fare quotes.", "This machine does intl fare
quotes.", "This machine does reservation bookings.", "This machine loses
your luggage.", and so on.

While this approach is heresey to the "everything is an object and distributed
objects are the wave of the future" religion being taught to the youngsters
in school, it is the only rational way to put up a system that runs
at 20,000+ TPC/A per second, has 1 minute of outage per year, and serves
100,000+ global customers with guaranteed round-trip response times of <4 sec
for the biggest customers. (This is only possible on a private network -
you cannot begin to approach this on the internet because of unpredicatable
propagation delays in the fabric.)  IOW, you can distribute at the object
level only if you are certain that the system as a whole can be recovered
from failure in an adequately short (as defined by the business requirements)
period of time.  For the typical web shopping or even simple B2B application,
this might be feasible over the Net.  But, at least in my experience, for
really big mission-critical systems where human life or the economic
survival of an institution is at stake, distributed objects won't cut it.

I will look into OSE further - now I'm intrigued.
------------------------------------------------------------------------------
Tim Daneliuk
tundra at tundraware.com



More information about the Python-list mailing list