RE: [Thread-SIG] Re: [Python-Dev] baby steps for free-threading

In my experience, allowing/requiring programmers to specify sharedness is a very rich source of hard-to-find bugs.
My experience is the opposite, since most objects aren't shared. :) You could probably do something like add an "owning thread" to each object structure, and on refcount throw an exception if not shared and the current thread isn't the owner. Not sure if space is a concern, but since the object is either shared or needs its own mutex, you make them a union: bool shared; union { python_thread_id_type id; python_mutex_type m; }; (Not saying I have an answer to the performance hit of locking on incref/decref, just saying that the development cost of 'shared' is very high.) Greg _______________________________________________ Thread-SIG maillist - Thread-SIG@python.org http://www.python.org/mailman/listinfo/thread-sig

On Wed, 19 Apr 2000, Salz, Rich wrote:
Regardless of complexity or lack thereof, any kind of "specified sharedness" cannot be implemented. Consider the case where a programmer forgets to note the sharedness. He passes the object to another thread. At certain points: BAM! The interpreter dumps core. Guido has specifically stated that *nothing* should ever allow that (in terms of pure Python code; bad C extension coding is all right). Sharedness has merit, but it cannot be used :-( Cheers, -g -- Greg Stein, http://www.lyra.org/

Greg Stein wrote:
Too bad that we don't have incref/decref as methods. The possible mutables which have to be protected could in fact carry a thread handle of their current "owner" (probably the one who creted them), and incref would check whether the owner is still same. If it is not same, then the owner field would be wiped, and that turns the (higher cost) shared refcounting on, and all necessary protection as well. (Maybe some extra care is needed to ensure that this info isn't changed while we are testing it). Without inc/dec-methods, something similar could be done, but every inc/decref will be a bit more expensive since we must figure out wether we have a mutable or not. ciao - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

On Wed, 19 Apr 2000, Christian Tismer wrote:
... Too bad that we don't have incref/decref as methods.
This would probably impose more overhead than some of the atomic inc/dec mechanisms.
The possible mutables which have to be protected could
Non-mutable objects must be protected, too. An integer can be shared just as easily as a list.
Ah. Neat. "Automatic marking of shared-ness" Could work. That initial test for the thread id could be expensive, though. What is the overhead of getting the current thread id? [ ... thinking about the code ... ] Nope. Won't work at all. There is a race condition when an object "becomes shared". DECREF: if ( object is not shared ) /* whoops! it just became shared! */ --(op)->ob_refcnt; else atomic_decrement(op) To prevent the race, you'd need an interlock which is more expensive than an atomic decrement. Cheers, -g -- Greg Stein, http://www.lyra.org/

Greg Stein wrote:
Uhh, right. Everything is mutable, since me mutate the refcount :-( ...
Zero if we cache it in the thread state.
[ ... thinking about the code ... ]
Nope. Won't work at all.
@#$%§!!-| yes-you-are-right - gnnn!
Really, sad but true. Are atomic decrements really so cheap, meaning "are they mapped to the atomic dec opcode"? Then this is all ok IMHO. ciao - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

On Wed, 19 Apr 2000, Christian Tismer wrote:
You don't have the thread state at incref/decref time. And don't say "_PyThreadState_Current" or I'll fly to Germany and personally kick your ass :-)
On some platforms and architectures, they *might* be. On Win32, we call InterlockedIncrement(). No idea what that does, but I don't think that it is a macro or compiler-detected thingy to insert opcodes. I believe there is a function call involved. pthreads do not define atomic inc/dec, so we must use a critical section + normal inc/dec operators. Linux has a kernel macro for atomic inc/dec, but it is only valid if __SMP__ is defined in your compilation context. etc. Platforms that do have an API (as Donn stated: BeOS has one; Win32 has one), they will be cheaper than an interlock. Therefore, we want to take advantage of an "atomic inc/dec" semantic when possible (and fallback to slower stuff when not). Cheers, -g -- Greg Stein, http://www.lyra.org/

Greg Stein wrote:
A real temptation to see whether I can really get you to Germany :-)) ... Thanks for all the info.
Linux has a kernel macro for atomic inc/dec, but it is only valid if __SMP__ is defined in your compilation context.
Well, and while it looks cheap, it is for sure expensive since several caches are flushed, and the system is stalled until the modified value is written back into the memory bank. Could it be that we might want to use another thread design at all? I'm thinking of running different interpreters in the same process space, but with all objects really disjoint, invisible between the interpreters. This would perhaps need some internal changes, in order to make all the builtin free-lists disjoint as well. Now each such interpreter would be running in its own thread without any racing condition at all so far. To make this into threading and not just a flavor of multitasking, we now need of course shared objects, but only those objects which we really want to share. This could reduce the cost for free threading to nearly zero, except for the (hopefully) few shared objects. I think, instead of shared globals, it would make more sense to have some explicit shared resource pool, which controls every access via mutexes/semas/whateverweneed. Maybe also that we would prefer to copy objects into it over sharing, in order to minimize collisions. I hope the need for true sharing can be minimized to a few variables. Well, I hope. "freethreads" could even coexist with the current locking threads, we would not even need a special build for them, but to rethink threading. Like "the more free threading is, the more disjoint threads are". are-you-now-convinced-to-come-and-kick-my-ass-ly y'rs - chris :-) -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

Chris> I think, instead of shared globals, it would make more sense to Chris> have some explicit shared resource pool, which controls every Chris> access via mutexes/semas/whateverweneed. Tuple space, anyone? Check out http://www.snurgle.org/~pybrenda/ It's a Linda implementation for Python. Linda was developed at Yale by David Gelernter. Unfortunately, he's better known to the general public as being one of the Unabomber's targets. You can find out more about Linda at http://www.cs.yale.edu/Linda/linda.html Skip

Skip Montanaro wrote:
Very interesting, indeed.
Many broken links. The most activity appears to have stopped around 94/95, the project looks kinda dead. But this doesn't mean that we cannot learn from them. Will think more when the starship problem is over... ciao - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

>> http://www.cs.yale.edu/Linda/linda.html Chris> Many broken links. The most activity appears to have stopped Chris> around 94/95, the project looks kinda dead. But this doesn't mean Chris> that we cannot learn from them. Yes, I think Linda mostly lurks under the covers these days. Their Piranha project, which aims to soak up spare CPU cycles to do parallel computing, uses Linda. I suspect Linda is probably hidden somewhere inside Lifestreams as well. As a correction to my original note, Nicholas Carriero was the other primary lead on Linda. I no longer recall the details, but he may have been on of Gelernter's grad students in the late 80's. Skip

On Thu, 20 Apr 2000, Christian Tismer wrote:
*Steps out of the woodwork and bows* PyBrenda doesn't have a thread implementation, but it could be adapted to do so. It might be prudent to eliminate the use of TCP/IP in that case as well. In case anyone is interested, I just created a mailing list for PyBrenda at egroups: http://www.egroups.com/group/pybrenda-users -- Milton L. Hankins \\ ><> Ephesians 5:2 ><> http://www.snurgle.org/~mhankins // <mlh@swl.msd.ray.com> These are my opinions, not Raytheon's. \\ W. W. J. D. ?

Linda is also the inspiration for Sun's JavaSpaces, an easier-to-use layer on top of Jini: http://java.sun.com/products/javaspaces/ http://cseng.aw.com/bookpage.taf?ISBN=0-201-30955-6 On the plus side: 1. It's much (much) easier to use than mutex, semaphore, or monitor models: students in my parallel programming course could start writing C-Linda programs after (literally) five minutes of instruction. 2. If you're willing/able to do global analysis of access patterns, its simplicity doesn't have a significant performance penalty. 3. (Bonus points) It integrates very well with persistence schemes. On the minus side: 1. Some things that "ought" to be simple (e.g. barrier synchronization) are surprisingly difficult to get right, efficiently, in vanilla Linda-like systems. Some VHLL derivates (based on SETL and Lisp dialects) solved this in interesting ways. 2. It's different enough from hardware-inspired shared-memory + mutex models to inspire the same "Huh, that looks weird" reaction as Scheme's parentheses, or Python's indentation. On the other hand, Bill Joy and company are now backing it... Personal opinion: I've felt for 15 years that something like Linda could be to threads and mutexes what structured loops and conditionals are to the "goto" statement. Were it not for the "Huh" effect, I'd recommend hanging "Danger!" signs over threads and mutexes, and making tuple spaces the "standard" concurrency mechanism in Python. I'd also recommend calling the system "Carol", after Monty Python regular Carol Cleveland. The story is that Linda itself was named after the 70s porn star Linda Lovelace, in response to the DoD naming its language "Ada" after the other Lovelace... Greg p.s. I talk a bit about Linda, and the limitations of the vanilla approach, in http://mitpress.mit.edu/book-home.tcl?isbn=0262231867.

[Greg Wilson, on Linda and JavaSpaces]
There's no question about tuple spaces being easier to learn and to use, but Python slams into a conundrum here akin to the "floating-point versus *anything* sane <wink>" one: Python's major real-life use is as a glue language, and threaded apps (ditto IEEE-754 floating-point apps) are overwhelmingly what it needs to glue *to*. So Python has to have a good thread story. Free-threading would be a fine enhancement of it, Tuple spaces (spelled "PyBrenda" or otherwise) would be a fine alternative to it, but Python can't live without threads too. And, yes, everyone who goes down Hoare's CSP road gets lost <0.7 wink>.

On Thu, 20 Apr 2000, Christian Tismer wrote:
Yes, Bill mentioned that yesterday. Important fact, but there isn't much you can do -- they must be atomic.
No. Now you're just talking processes with IPC. Yes, they happen to run in threads, but you got none of the advantages of a threaded application. Threading is about sharing an address space. Cheers, -g -- Greg Stein, http://www.lyra.org/

Greg Stein wrote:
On Thu, 20 Apr 2000, Christian Tismer wrote:
[me, about free threading with less sharing]
No. Now you're just talking processes with IPC. Yes, they happen to run in threads, but you got none of the advantages of a threaded application.
Are you shure that every thread user shares your opinion? I see many people using threads just in order to have multiple tasks in parallel, with none or quite few shared variables.
Threading is about sharing an address space.
This is part of the truth. There are a number of other reasons to use threads, too. Since Python has nothing really private, this implies in fact to protect every single object for free threading, although nobody wants this in the first place to happen. Other languages have much fewer problems here (I mean C, C++, Delphi...), they are able to do the right thing in the right place. Python is not designed for that. Why do you want to enforce the impossible, letting every object pay a high penalty to become completely thread-safe? Sharing an address space should not mean to share everything, but something. If Python does not support this, we should think of a redesign of its threading model, instead of loosing so much of efficiency. You end up in a situation where all your C extensions can run free threaded at high speed, just Python is busy all the time to fight the threading. That is not Python. You know that I like to optimize things. For me, optimization mut give an overall gain, not just in one area, where others get worse. If free threading cannot be optimized in a way that gives better overall performance, then it is a wrong optimization to me. Well, this is all speculative until we did some measures. Maybe I'm just complaining about 1-2 percent of performance loss, then I'd agree to move my complaining into /dev/null :-) ciao - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

On Fri, 21 Apr 2000, Christian Tismer wrote:
About the only time I use threads is when 1) I'm doing something asynchronous in an event loop-driven paradigm (such as Tkinter) or 2) I'm trying to emulate fork() under win32
How does Java solve this problem? (Is this analagous to native vs. green threads?)
Hmm, how about declaring only certain builtins as free-thread safe? Or is "the impossible" necessary because of the nature of incref/decref? -- Milton L. Hankins :: ><> Ephesians 5:2 ><> Software Engineer, Raytheon Systems Company :: <mlh@swl.msd.ray.com> http://amasts.msd.ray.com/~mlh :: RayComNet 7-225-4728

http://www.javacats.com/US/articles/MultiThreading.html I would like sync foo: bloc of code here maybe we could merge in some Occam while were at it. B^) sync would be a most excellent operator in python.

http://www.cs.bris.ac.uk/~alan/javapp.html Take a look at the above link. It merges the Occam model with Java and uses 'channel based' interfaces (not sure exactly what this is). But they seem pretty exicted. I vote for using InterlockedInc/Dec as it is available as an assembly instruction on almost everyplatform. Could be then derive all other locking schemantics from this? And our portability problem is solved if it comes in the box with gcc. On Fri, 21 Apr 2000, Sean Jensen_Grey wrote:

Channel-based programming has been called "the revenge of the goto", as in, "Where the hell does this channel go to?" Programmers must manage conversational continuity manually (i.e. keep track of the origins of messages, so that they can be replied to). It also doesn't really help with the sharing problem that started this thread: if you want a shared integer, you have to write a little server thread that knows how to act like a semaphore, and then it read/write requests that are exactly equivalent to P and V operations (and subject to all the same abuses). Oh, and did I mention the joys of trying to draw a semi-accurate diagram of the plumbing in your program after three months of upgrade work? *shudder* Greg

On Fri, 21 Apr 2000, Christian Tismer wrote:
Now you're just being argumentative. I won't respond to this.
Existing Python semantics plus free-threading places us in this scenario. Many people have asked for free-threading, and the number of inquiries that I receive have grown over time. (nobody asked in 1996 when I first published my patches; I get a query every couple months now)
It is more than this. In my last shot at this, pystone ran about half as fast. There are a few things that will be different this time around, but it certainly won't in the "few percent" range. Presuming you can keep your lock contention low, then your overall performances *goes up* once you have a multiprocessor machine. Sure, each processor runs Python (say) 10% slower, but you have *two* of them going. That is 180% compared to a central-lock Python on an MP machine. Lock contention: my last patches had really high contention. It didn't scale across processors well. This round will have more fine-grained locks than the previous version. But it will be interesting to measure the contention. Cheers, -g -- Greg Stein, http://www.lyra.org/

Interesting thought: according to patches recently posted to patches@python.org (but not yet vetted), "turning on" threads on Win32 in regular Python also slows down Pystone considerably. Maybe it's not so bad? Maybe those patches contain a hint of what we could do? --Guido van Rossum (home page: http://www.python.org/~guido/)

On Fri, 21 Apr 2000, Guido van Rossum wrote:
I think that my tests were threaded vs. free-threaded. It has been so long ago, though... :-) Yes, we'll get those patches reviewed and installed. That will at least help the standard threading case. With more discrete locks (e.g. one per object or one per code section), then we will reduce lock contention. Working on improving the lock mechanism itself and the INCREF/DECREF system will help, too. But this initial thread was to seek people to assist with some coding to get stuff into 1.6. The heavy lifting will certainly be after 1.6, but we can get some good stuff in *today*. We'll examine performance later on, then start improving it. Cheers, -g -- Greg Stein, http://www.lyra.org/

Guido van Rossum wrote:
I had a rough look at the patches but didn't understand enough yet. But I tried the sample scriptlet on python 1.5.2 and Stackless Python - see here: D:\python>python -c "import test.pystone;test.pystone.main()" Pystone(1.1) time for 10000 passes = 1.96765 This machine benchmarks at 5082.2 pystones/second D:\python>python spc/threadstone.py Pystone(1.1) time for 10000 passes = 5.57609 This machine benchmarks at 1793.37 pystones/second This is even worse than Markovitch's observation. Now, let's try with Stackless Python: D:\python>cd spc D:\python\spc>python -c "import test.pystone;test.pystone.main()" Pystone(1.1) time for 10000 passes = 1.843 This machine benchmarks at 5425.94 pystones/second D:\python\spc>python threadstone.py Pystone(1.1) time for 10000 passes = 3.27625 This machine benchmarks at 3052.27 pystones/second Isn't that remarkable? Stackless performs nearly 1.8 as good under threads. Why? I've optimized the ticker code away for all those "fast" opcodes which never can cause another interpreter incarnation. Standard Python does a bit too much here, dealing the same way with extremely fast opcodes like POP_TOP, as with a function call. Responsiveness is still very good. Markovitch's example also tells us this story: Even with his patches, the threading stuff still costs 10 percent. This is the lock that we touch every ten opcodes. In other words: touching a lock costs about as much as an opcode costs on average. ciao - chris threadstone.py: import thread # Start empty thread to initialise thread mechanics (and global lock!) # This thread will finish immediately thus won't make much influence on # test results by itself, only by that fact that it initialises global lock thread.start_new_thread(lambda : 1, ()) import test.pystone test.pystone.main() -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

Greg, Greg Stein wrote: <snip/
Why didn't I think of this. MP is a very very good point. Makes now all much sense to me. sorry for being dumb - happy easter - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

[Greg Stein]
Huh! That means people ask me about it more often than they ask you <wink>. I'll add, though, that you have to dig into the inquiry: almost everyone who asks me is running on a uniprocessor machine, and are really after one of two other things: 1. They expect threaded stuff to run faster if free-threaded. "Why?" is a question I can't answer <0.5 wink>. 2. Dealing with the global lock drives them insane, especially when trying to call back into Python from a "foreign" C thread. #2 may be fixable via less radical means (like a streamlined procedure enabled by some relatively minor core interpreter changes, and clearer docs). I'm still a fan of free-threading! It's just one of those things that may yield a "well, ya, that's what I asked for, but turns out it's not what I *wanted*" outcome as often as not. enthusiastically y'rs - tim

On Tue, 25 Apr 2000, Tim Peters wrote:
Heh. Yes, I definitely see this one. But there are some clueful people out there, too, so I'm not totally discouraged :-)
No doubt. I was rather upset with Guido's "Swap" API for the thread state. Grr. I sent him a very nice (IMO) API that I used for my patches. The Swap was simply a poor choice on his part. It implies that you are swapping a thread state for another (specifically: the "current" thread state). Of course, that is wholly inappropriate in a free-threading environment. All those calls to _Swap() will be overhead in an FT world. I liked my "PyThreadState *PyThreadState_Ensure()" function. It would create the sucker if it didn't exist, then return *this* thread's state to you. Handy as hell. No monkeying around with "Get. oops. didn't exist. let's create one now."
hehe. Damn straight. :-) Cheers, -g -- Greg Stein, http://www.lyra.org/

On Wed, 19 Apr 2000, Salz, Rich wrote:
Regardless of complexity or lack thereof, any kind of "specified sharedness" cannot be implemented. Consider the case where a programmer forgets to note the sharedness. He passes the object to another thread. At certain points: BAM! The interpreter dumps core. Guido has specifically stated that *nothing* should ever allow that (in terms of pure Python code; bad C extension coding is all right). Sharedness has merit, but it cannot be used :-( Cheers, -g -- Greg Stein, http://www.lyra.org/

Greg Stein wrote:
Too bad that we don't have incref/decref as methods. The possible mutables which have to be protected could in fact carry a thread handle of their current "owner" (probably the one who creted them), and incref would check whether the owner is still same. If it is not same, then the owner field would be wiped, and that turns the (higher cost) shared refcounting on, and all necessary protection as well. (Maybe some extra care is needed to ensure that this info isn't changed while we are testing it). Without inc/dec-methods, something similar could be done, but every inc/decref will be a bit more expensive since we must figure out wether we have a mutable or not. ciao - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

On Wed, 19 Apr 2000, Christian Tismer wrote:
... Too bad that we don't have incref/decref as methods.
This would probably impose more overhead than some of the atomic inc/dec mechanisms.
The possible mutables which have to be protected could
Non-mutable objects must be protected, too. An integer can be shared just as easily as a list.
Ah. Neat. "Automatic marking of shared-ness" Could work. That initial test for the thread id could be expensive, though. What is the overhead of getting the current thread id? [ ... thinking about the code ... ] Nope. Won't work at all. There is a race condition when an object "becomes shared". DECREF: if ( object is not shared ) /* whoops! it just became shared! */ --(op)->ob_refcnt; else atomic_decrement(op) To prevent the race, you'd need an interlock which is more expensive than an atomic decrement. Cheers, -g -- Greg Stein, http://www.lyra.org/

Greg Stein wrote:
Uhh, right. Everything is mutable, since me mutate the refcount :-( ...
Zero if we cache it in the thread state.
[ ... thinking about the code ... ]
Nope. Won't work at all.
@#$%§!!-| yes-you-are-right - gnnn!
Really, sad but true. Are atomic decrements really so cheap, meaning "are they mapped to the atomic dec opcode"? Then this is all ok IMHO. ciao - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

On Wed, 19 Apr 2000, Christian Tismer wrote:
You don't have the thread state at incref/decref time. And don't say "_PyThreadState_Current" or I'll fly to Germany and personally kick your ass :-)
On some platforms and architectures, they *might* be. On Win32, we call InterlockedIncrement(). No idea what that does, but I don't think that it is a macro or compiler-detected thingy to insert opcodes. I believe there is a function call involved. pthreads do not define atomic inc/dec, so we must use a critical section + normal inc/dec operators. Linux has a kernel macro for atomic inc/dec, but it is only valid if __SMP__ is defined in your compilation context. etc. Platforms that do have an API (as Donn stated: BeOS has one; Win32 has one), they will be cheaper than an interlock. Therefore, we want to take advantage of an "atomic inc/dec" semantic when possible (and fallback to slower stuff when not). Cheers, -g -- Greg Stein, http://www.lyra.org/

Greg Stein wrote:
A real temptation to see whether I can really get you to Germany :-)) ... Thanks for all the info.
Linux has a kernel macro for atomic inc/dec, but it is only valid if __SMP__ is defined in your compilation context.
Well, and while it looks cheap, it is for sure expensive since several caches are flushed, and the system is stalled until the modified value is written back into the memory bank. Could it be that we might want to use another thread design at all? I'm thinking of running different interpreters in the same process space, but with all objects really disjoint, invisible between the interpreters. This would perhaps need some internal changes, in order to make all the builtin free-lists disjoint as well. Now each such interpreter would be running in its own thread without any racing condition at all so far. To make this into threading and not just a flavor of multitasking, we now need of course shared objects, but only those objects which we really want to share. This could reduce the cost for free threading to nearly zero, except for the (hopefully) few shared objects. I think, instead of shared globals, it would make more sense to have some explicit shared resource pool, which controls every access via mutexes/semas/whateverweneed. Maybe also that we would prefer to copy objects into it over sharing, in order to minimize collisions. I hope the need for true sharing can be minimized to a few variables. Well, I hope. "freethreads" could even coexist with the current locking threads, we would not even need a special build for them, but to rethink threading. Like "the more free threading is, the more disjoint threads are". are-you-now-convinced-to-come-and-kick-my-ass-ly y'rs - chris :-) -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

Chris> I think, instead of shared globals, it would make more sense to Chris> have some explicit shared resource pool, which controls every Chris> access via mutexes/semas/whateverweneed. Tuple space, anyone? Check out http://www.snurgle.org/~pybrenda/ It's a Linda implementation for Python. Linda was developed at Yale by David Gelernter. Unfortunately, he's better known to the general public as being one of the Unabomber's targets. You can find out more about Linda at http://www.cs.yale.edu/Linda/linda.html Skip

Skip Montanaro wrote:
Very interesting, indeed.
Many broken links. The most activity appears to have stopped around 94/95, the project looks kinda dead. But this doesn't mean that we cannot learn from them. Will think more when the starship problem is over... ciao - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

>> http://www.cs.yale.edu/Linda/linda.html Chris> Many broken links. The most activity appears to have stopped Chris> around 94/95, the project looks kinda dead. But this doesn't mean Chris> that we cannot learn from them. Yes, I think Linda mostly lurks under the covers these days. Their Piranha project, which aims to soak up spare CPU cycles to do parallel computing, uses Linda. I suspect Linda is probably hidden somewhere inside Lifestreams as well. As a correction to my original note, Nicholas Carriero was the other primary lead on Linda. I no longer recall the details, but he may have been on of Gelernter's grad students in the late 80's. Skip

On Thu, 20 Apr 2000, Christian Tismer wrote:
*Steps out of the woodwork and bows* PyBrenda doesn't have a thread implementation, but it could be adapted to do so. It might be prudent to eliminate the use of TCP/IP in that case as well. In case anyone is interested, I just created a mailing list for PyBrenda at egroups: http://www.egroups.com/group/pybrenda-users -- Milton L. Hankins \\ ><> Ephesians 5:2 ><> http://www.snurgle.org/~mhankins // <mlh@swl.msd.ray.com> These are my opinions, not Raytheon's. \\ W. W. J. D. ?

Linda is also the inspiration for Sun's JavaSpaces, an easier-to-use layer on top of Jini: http://java.sun.com/products/javaspaces/ http://cseng.aw.com/bookpage.taf?ISBN=0-201-30955-6 On the plus side: 1. It's much (much) easier to use than mutex, semaphore, or monitor models: students in my parallel programming course could start writing C-Linda programs after (literally) five minutes of instruction. 2. If you're willing/able to do global analysis of access patterns, its simplicity doesn't have a significant performance penalty. 3. (Bonus points) It integrates very well with persistence schemes. On the minus side: 1. Some things that "ought" to be simple (e.g. barrier synchronization) are surprisingly difficult to get right, efficiently, in vanilla Linda-like systems. Some VHLL derivates (based on SETL and Lisp dialects) solved this in interesting ways. 2. It's different enough from hardware-inspired shared-memory + mutex models to inspire the same "Huh, that looks weird" reaction as Scheme's parentheses, or Python's indentation. On the other hand, Bill Joy and company are now backing it... Personal opinion: I've felt for 15 years that something like Linda could be to threads and mutexes what structured loops and conditionals are to the "goto" statement. Were it not for the "Huh" effect, I'd recommend hanging "Danger!" signs over threads and mutexes, and making tuple spaces the "standard" concurrency mechanism in Python. I'd also recommend calling the system "Carol", after Monty Python regular Carol Cleveland. The story is that Linda itself was named after the 70s porn star Linda Lovelace, in response to the DoD naming its language "Ada" after the other Lovelace... Greg p.s. I talk a bit about Linda, and the limitations of the vanilla approach, in http://mitpress.mit.edu/book-home.tcl?isbn=0262231867.

[Greg Wilson, on Linda and JavaSpaces]
There's no question about tuple spaces being easier to learn and to use, but Python slams into a conundrum here akin to the "floating-point versus *anything* sane <wink>" one: Python's major real-life use is as a glue language, and threaded apps (ditto IEEE-754 floating-point apps) are overwhelmingly what it needs to glue *to*. So Python has to have a good thread story. Free-threading would be a fine enhancement of it, Tuple spaces (spelled "PyBrenda" or otherwise) would be a fine alternative to it, but Python can't live without threads too. And, yes, everyone who goes down Hoare's CSP road gets lost <0.7 wink>.

On Thu, 20 Apr 2000, Christian Tismer wrote:
Yes, Bill mentioned that yesterday. Important fact, but there isn't much you can do -- they must be atomic.
No. Now you're just talking processes with IPC. Yes, they happen to run in threads, but you got none of the advantages of a threaded application. Threading is about sharing an address space. Cheers, -g -- Greg Stein, http://www.lyra.org/

Greg Stein wrote:
On Thu, 20 Apr 2000, Christian Tismer wrote:
[me, about free threading with less sharing]
No. Now you're just talking processes with IPC. Yes, they happen to run in threads, but you got none of the advantages of a threaded application.
Are you shure that every thread user shares your opinion? I see many people using threads just in order to have multiple tasks in parallel, with none or quite few shared variables.
Threading is about sharing an address space.
This is part of the truth. There are a number of other reasons to use threads, too. Since Python has nothing really private, this implies in fact to protect every single object for free threading, although nobody wants this in the first place to happen. Other languages have much fewer problems here (I mean C, C++, Delphi...), they are able to do the right thing in the right place. Python is not designed for that. Why do you want to enforce the impossible, letting every object pay a high penalty to become completely thread-safe? Sharing an address space should not mean to share everything, but something. If Python does not support this, we should think of a redesign of its threading model, instead of loosing so much of efficiency. You end up in a situation where all your C extensions can run free threaded at high speed, just Python is busy all the time to fight the threading. That is not Python. You know that I like to optimize things. For me, optimization mut give an overall gain, not just in one area, where others get worse. If free threading cannot be optimized in a way that gives better overall performance, then it is a wrong optimization to me. Well, this is all speculative until we did some measures. Maybe I'm just complaining about 1-2 percent of performance loss, then I'd agree to move my complaining into /dev/null :-) ciao - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

On Fri, 21 Apr 2000, Christian Tismer wrote:
About the only time I use threads is when 1) I'm doing something asynchronous in an event loop-driven paradigm (such as Tkinter) or 2) I'm trying to emulate fork() under win32
How does Java solve this problem? (Is this analagous to native vs. green threads?)
Hmm, how about declaring only certain builtins as free-thread safe? Or is "the impossible" necessary because of the nature of incref/decref? -- Milton L. Hankins :: ><> Ephesians 5:2 ><> Software Engineer, Raytheon Systems Company :: <mlh@swl.msd.ray.com> http://amasts.msd.ray.com/~mlh :: RayComNet 7-225-4728

http://www.javacats.com/US/articles/MultiThreading.html I would like sync foo: bloc of code here maybe we could merge in some Occam while were at it. B^) sync would be a most excellent operator in python.

http://www.cs.bris.ac.uk/~alan/javapp.html Take a look at the above link. It merges the Occam model with Java and uses 'channel based' interfaces (not sure exactly what this is). But they seem pretty exicted. I vote for using InterlockedInc/Dec as it is available as an assembly instruction on almost everyplatform. Could be then derive all other locking schemantics from this? And our portability problem is solved if it comes in the box with gcc. On Fri, 21 Apr 2000, Sean Jensen_Grey wrote:

Channel-based programming has been called "the revenge of the goto", as in, "Where the hell does this channel go to?" Programmers must manage conversational continuity manually (i.e. keep track of the origins of messages, so that they can be replied to). It also doesn't really help with the sharing problem that started this thread: if you want a shared integer, you have to write a little server thread that knows how to act like a semaphore, and then it read/write requests that are exactly equivalent to P and V operations (and subject to all the same abuses). Oh, and did I mention the joys of trying to draw a semi-accurate diagram of the plumbing in your program after three months of upgrade work? *shudder* Greg

On Fri, 21 Apr 2000, Christian Tismer wrote:
Now you're just being argumentative. I won't respond to this.
Existing Python semantics plus free-threading places us in this scenario. Many people have asked for free-threading, and the number of inquiries that I receive have grown over time. (nobody asked in 1996 when I first published my patches; I get a query every couple months now)
It is more than this. In my last shot at this, pystone ran about half as fast. There are a few things that will be different this time around, but it certainly won't in the "few percent" range. Presuming you can keep your lock contention low, then your overall performances *goes up* once you have a multiprocessor machine. Sure, each processor runs Python (say) 10% slower, but you have *two* of them going. That is 180% compared to a central-lock Python on an MP machine. Lock contention: my last patches had really high contention. It didn't scale across processors well. This round will have more fine-grained locks than the previous version. But it will be interesting to measure the contention. Cheers, -g -- Greg Stein, http://www.lyra.org/

Interesting thought: according to patches recently posted to patches@python.org (but not yet vetted), "turning on" threads on Win32 in regular Python also slows down Pystone considerably. Maybe it's not so bad? Maybe those patches contain a hint of what we could do? --Guido van Rossum (home page: http://www.python.org/~guido/)

On Fri, 21 Apr 2000, Guido van Rossum wrote:
I think that my tests were threaded vs. free-threaded. It has been so long ago, though... :-) Yes, we'll get those patches reviewed and installed. That will at least help the standard threading case. With more discrete locks (e.g. one per object or one per code section), then we will reduce lock contention. Working on improving the lock mechanism itself and the INCREF/DECREF system will help, too. But this initial thread was to seek people to assist with some coding to get stuff into 1.6. The heavy lifting will certainly be after 1.6, but we can get some good stuff in *today*. We'll examine performance later on, then start improving it. Cheers, -g -- Greg Stein, http://www.lyra.org/

Guido van Rossum wrote:
I had a rough look at the patches but didn't understand enough yet. But I tried the sample scriptlet on python 1.5.2 and Stackless Python - see here: D:\python>python -c "import test.pystone;test.pystone.main()" Pystone(1.1) time for 10000 passes = 1.96765 This machine benchmarks at 5082.2 pystones/second D:\python>python spc/threadstone.py Pystone(1.1) time for 10000 passes = 5.57609 This machine benchmarks at 1793.37 pystones/second This is even worse than Markovitch's observation. Now, let's try with Stackless Python: D:\python>cd spc D:\python\spc>python -c "import test.pystone;test.pystone.main()" Pystone(1.1) time for 10000 passes = 1.843 This machine benchmarks at 5425.94 pystones/second D:\python\spc>python threadstone.py Pystone(1.1) time for 10000 passes = 3.27625 This machine benchmarks at 3052.27 pystones/second Isn't that remarkable? Stackless performs nearly 1.8 as good under threads. Why? I've optimized the ticker code away for all those "fast" opcodes which never can cause another interpreter incarnation. Standard Python does a bit too much here, dealing the same way with extremely fast opcodes like POP_TOP, as with a function call. Responsiveness is still very good. Markovitch's example also tells us this story: Even with his patches, the threading stuff still costs 10 percent. This is the lock that we touch every ten opcodes. In other words: touching a lock costs about as much as an opcode costs on average. ciao - chris threadstone.py: import thread # Start empty thread to initialise thread mechanics (and global lock!) # This thread will finish immediately thus won't make much influence on # test results by itself, only by that fact that it initialises global lock thread.start_new_thread(lambda : 1, ()) import test.pystone test.pystone.main() -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

Greg, Greg Stein wrote: <snip/
Why didn't I think of this. MP is a very very good point. Makes now all much sense to me. sorry for being dumb - happy easter - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

[Greg Stein]
Huh! That means people ask me about it more often than they ask you <wink>. I'll add, though, that you have to dig into the inquiry: almost everyone who asks me is running on a uniprocessor machine, and are really after one of two other things: 1. They expect threaded stuff to run faster if free-threaded. "Why?" is a question I can't answer <0.5 wink>. 2. Dealing with the global lock drives them insane, especially when trying to call back into Python from a "foreign" C thread. #2 may be fixable via less radical means (like a streamlined procedure enabled by some relatively minor core interpreter changes, and clearer docs). I'm still a fan of free-threading! It's just one of those things that may yield a "well, ya, that's what I asked for, but turns out it's not what I *wanted*" outcome as often as not. enthusiastically y'rs - tim

On Tue, 25 Apr 2000, Tim Peters wrote:
Heh. Yes, I definitely see this one. But there are some clueful people out there, too, so I'm not totally discouraged :-)
No doubt. I was rather upset with Guido's "Swap" API for the thread state. Grr. I sent him a very nice (IMO) API that I used for my patches. The Swap was simply a poor choice on his part. It implies that you are swapping a thread state for another (specifically: the "current" thread state). Of course, that is wholly inappropriate in a free-threading environment. All those calls to _Swap() will be overhead in an FT world. I liked my "PyThreadState *PyThreadState_Ensure()" function. It would create the sucker if it didn't exist, then return *this* thread's state to you. Handy as hell. No monkeying around with "Get. oops. didn't exist. let's create one now."
hehe. Damn straight. :-) Cheers, -g -- Greg Stein, http://www.lyra.org/
participants (9)
-
Christian Tismer
-
Greg Stein
-
Guido van Rossum
-
gvwilson@nevex.com
-
Milton L. Hankins
-
Salz, Rich
-
Sean Jensen_Grey
-
Skip Montanaro
-
Tim Peters