
At 01:23 AM 8/1/2001 -0400, Tim Peters wrote:
[Paul Prescod]
What is the downside of the global lock on the average single processor machine? I tend to think that the "default" threading model should allow simple and easy, everything-shared multi-threading on ordinary machines. Having a multi-processor-friendly advanced mode is a great extension for the wizards.
[Dan Sugalski]
If you hold the lock during an I/O operation, you'll lose time you could have otherwise used. Getting and releasing a global lock frequently also costs performance you might otherwise have used in other places. Mutex releases require memory coherency, which will force your CPU to flush any pending writes that might be hanging about, which will tend to drop it's efficiency, especially on heavily out-of-order machines like the Alpha.
Also, that is a zillion and a half mutex aquisition and releases, most of which you probably have no need of.
Python doesn't actually suffer from either of these problems: while there's a pair of acquire/release-global-lock macros around potentially blocking I/O calls in Python's runtime (ditto sleep(), etc), no mutex is actually allocated before *somebody* calls PyEval_InitThreads:
Yeah, I figured you didn't initialize or aquire the mutex until it was actually needed. One of the nice side-benefits of the global opcode-aquired lock.
However, Python calls the platform's thread-safe libraries regardless, and *that* can be a huge speed hit. A minor example is that system malloc() is more expensive in Microsoft's thread-safe version of libc.
Everywhere else too, I'd bet. I've been considering thread-specific memory pools because of this. (Well, this is one reason, at least)
A monster example is speed of line-at-a-time input: we only recently discovered that Python's getc()-in-a-loop was killing us on many platforms because the platform threadsafe library implementation locked and unlocked the stream for each character.
Ouch, I'd bet that hurts. Has anyone timed the difference between making lots of getc calls and making a few larger reads and managing the buffers internally? I can see it going either way, and another data point would be useful to have.
Worming around that brought our input speed much closer to Perl's (up to 50x faster on Tru64 Unix).
FWIW, at least on Tru64 and VMS, there are a number of faster thread calls if you don't mind bypassing some library error checking. (Which is OK if you're guaranteed to have correct parameters) I can dig them up if you like. A tweak of pthread_get_specific made a depressingly large performance boost for me on my VMS box. (I did say perl's threading models weren't that good... :) Not worth it if you don't make many calls, though.
It's still slower on most boxes, though, because we're still threadsafe, but, last I looked, Perl's line-at-a-time input tricks mucked with stdio structs directly without benefit of exclusion (and are not threadsafe).
Yep, perl's pthread-based threading model doesn't guarantee threadsafe I/O on many platforms. The alternative threading model doesn't use that codepath. (It's currently primarily windows-based, and we don't do the buffer lookbehind stuff there)
... (I do work on SMP machines as a rule, so I am a little biased against things that single-thread me when I don't need it--what's the point of 500% idle time?)
Greg Stein is the fellow to talk with about about "free threading" of Python. He had that at least mostly working several years ago, but it was a major project, that patch is way out of date now, and Python is much more elegant now. Oops! I didn't mean "elegant", I meant "bigger" <wink>.
:) I'm as much (or more, which is generally an odd thing, but threads are almost inherently odd) interested in how things look at the programmer level and what guarantees are made as how things look under the hood. (It's all just a SMOP, right?) Dan --------------------------------------"it's like this"------------------- Dan Sugalski even samurai dan@sidhe.org have teddy bears and even teddy bears get drunk