[Python-Dev] deja-vu .. python locking

Tue Sep 19 09:51:18 CEST 2006

> Ah, I think I understand now. First the minor critique: I believe
> the locking algorithm isn't thread-safe:
> 
>   while (ob->owner_thread != self_thread()) {
> 	 unlock_mutex(thread_mutex[self_thread()])
> 	// wait for owning thread to go to quiscent state
> 	 lock_mutex(thread_mutex[ob->owner_thread])
> 	 ob->owner_thread = self_thread()
> 	 unlock_mutex(thread_mutex[ob->owner_thread])
> 	 lock_mutex(thread_mutex[self_thread()])
>   }
> 
> If two threads are competing for the same object held by a third
> thread, they may simultaneously enter the while loop, and then
> simultaneously try to lock the owner_thread. Now, one will win,
> and own the object. Later, the other will gain the lock, and
> unconditionally overwrite ownership. This will cause two threads
> to own the objects, which is an error.

oops .. well it seems as very stupid error on my side. Yes you are
absolutely right, I'll have to rethink it. I hope it is possible
to do it in correct way...

> The more fundamental critique is: Why? It seems you do this
> to improve efficiency, (implicitly) claiming that it is
> more efficient to keep holding the lock, instead of releasing
> and re-acquiring it each time.
> 
> I claim that this doesn't really matter: any reasonable
> mutex implementation will be "fast" if there is no lock
> contention. On locking, it will not invoke any system
> call if the lock is currently not held (but just
> atomically test-and-set some field of the lock); on
> unlocking, it will not invoke any system call if
> the wait list is empty. As you also need to test, there
> shouldn't be much of a performance difference.

I measured it. Lock op in futex based linux locking is of the same
speed as windows critical section and it is about 30 cycles on my
P4 1.8GHz in uncontented case.
As explained in already mentioned http://www.linuxjournal.com/article/6993
it seems due to pipeline flush during cmpxchg insn.
And there will be cacheline transfer penalty which is much larger. So
that mutex locking will take time comparable with protected code itself
(assuming fast code like dict/list read).
Single compare will take ten times less.
Am I missing something ?

thanks, Martin