[Tutor] Starbucks does not use two-phase commit

Sun Jan 22 09:13:57 CET 2006

On Sat, 21 Jan 2006, Todd Maynard wrote:

> I want to thank you for ruining my plans for a relaxing Saturday
> morning.  As a thread newbie I killed several hours playing around with
> your code.

Hi Todd,

Sorry about that.  I hope you were relaxing in a cafe while playing with
the code.

> One thing I noticed is that sometimes the program would hang, which I
> figured was the Queue code blocking in the Ticket claim function. I used
> exception handling to deal with that situation cleanly.

That's odd.  There shouldn't be anything that blocks the code.  Oh!  Did
you make changes to the test code, or did the hanging occur in the
original code in:

    http://mail.python.org/pipermail/tutor/2006-January/044567.html

I'm curious because nothing there should fundamentally block, assuming
that _doJob() doesn't dies badly with an exception.  If _doJob() dies, the
server dies, and that's bad.  *grin*

Do you mind showing what the exception handling looks like in your code?

> I then decided that it wasn't very nice of Starbucks to close after
> accepting my order without giving me my Latte, so I changed that part of
> the code to:

[code cut]

> I am 99.44% sure that this is thread safe, reasoning being:
> 	 setting the acceptNew to False and adding the QUIT_NOW happens in the same
> thread so it is impossible for another job to get scheduled after the
> QUIT_NOW - so no thread will end up hanging...

Bad news: no.  *grin*

There's a "race condition".  Let's go into some detail with this, since
this is not obvious stuff.

First, let's look at the code again --- I'll label three lines with (a),
(b), and (c), to make it a little easier to see the race.

###############################################################
def schedule(self,job):
    if self.acceptNew == True:                           ## (a)
        outputQueue=Queue()
        self.queue.put((job,outputQueue))
        return Ticket(outputQueue)
    else:
         print "Server not accepting any new requests."
         return None

def scheduleShutdown(self):                              ## (b)
    self.queue.put((Server._QUIT_NICELY,None))

def _jobLoop(self):
    while True:
        print "Looping ... "
        (nextJob, outputQueue) = self.queue.get()
        if nextJob is server._QUIT_NOW:
            return
        if nextJob is server._QUIT_NICELY:               ## (c)
            self.acceptNew = False
            self.queue.put((Server._QUIT_NOW,None))
        else:
            returnValue=self._doJob(nextJob)
            outputQueue.put(returnValue)
##############################################################

Let's imagine three threads, which I'll name C1, C2, and S.  C1 and C2
will be distinct client threads, and S will be the server thread that runs
through _jobLoop().

Imagine the following scenario.  The server's online, and its work queue
is empty.

    1.  C1 calls schedule(), and reaches the line labeled (a).  At this
        point, server.acceptNew is True, so it goes into the body of the
        if statement.  But wait...

    2.  Now we context switch to C2.  C2 calls scheduleShutdown()
        in its entirety.  There is now a _QUIT_NICELY element in the
        queue.  C2 is done for.

    3.  Now we context switch to the server thread S.  It grabs the
        _QUIT_NICELY, and puts a _QUIT_NOW.  Let's imagine that S
        continues and loops again.  In the next loop through _jobLoop(),
        it sees _QUIT_NOW and exits.   S is done for.  Muhahaha.

    4.  Now we context switch back to C1 and continue with:

        outputQueue = Queue()
        self.queue.put((job,outputQueue))
        return Ticket(outputQueue)

In this scenario, poor C1 is left holding a ticket that will never cash
out.

One way to fix this problem is to make calling schedule() and
scheduleShutdown() "atomic" in this sense: if we're calling schedule(), we
shouldn't be able to context switch into a call to scheduleShutdown(), and
visa-versa.

Our troubles started at step 2 of the above scenario, where two clients
jostled for attention.  If we prevent that particular situation --- if we
force all our clients to stand in line to get served --- then we'll be
fine.  So we might look into some synchronizing tool, like a Lock object:

    http://www.python.org/doc/lib/lock-objects.html

Concretely, we can add an exclusive Lock object to the server's __init__:

    def __init__(self):
        self.clientLock = Lock()

and make sure to acquire-and-release in any of our schedule* functions:

    def schedule*(self, job):
        self.clientLock.acquire()
        try:
            ...
        finally:
            self.clientLock.release()

> However, I would sleep a little better if you could reassure me that I
> am right, and would sleep even better if you could give me a method to
> test this.

I'm sorry; I can't provide either.  That doesn't mean that such things
don't exist, but only that I don't know about them.  (The only formal
training I've received on this, so far, has been a standard intro
Operating Systems CS course.)

So you might want to check with others.  The way I caught the race
condition above was by just by trying to be in a very foul mood while
reading the code.

> This kinda stuff looks tricky to test with standard unittest
> methodology....

It might be hard.  It's an active CS research topic to formally check for
race conditions.  There may be commercial tools to build unit tests to
hammer for race conditions, but I don't know much about them.

For example, associate Professor Cormac Flanagan does research on
statically detecting race conditions:

    http://www.soe.ucsc.edu/~cormac/

And Google Scholar does show a heck of a lot of recent papers on this:

http://scholar.google.com/scholar?hl=en&lr=&q=thread+race+condition&btnG=Search

so this might not be a closed issue yet.

Hmmm... hey, Flanagan's paper looks interesting!  Ok, I'm printing it out
now... now where's my coffee...  Weekend reading for me.

Well, thanks for ruining my weekend too.  *grin*