RE: Moving towards Python 3.0 (was Re: [Python-Dev] Speed up functioncalls)
Evan Jones writes:
My knowledge about garbage collection is weak, but I have read a little bit of Hans Boehm's work on garbage collection. [...] The biggest disadvantage mentioned is that simple pointer assignments end up becoming "increment ref count" operations as well...
Hans Boehm certainly has some excellent points. I believe a little searching through the Python dev archives will reveal that attempts have been made in the past to use his GC tools with CPython, and that the results have been disapointing. That may be because other parts of CPython are optimized for reference counting, or it may be just because this stuff is so bloody difficult! However, remember that changing away from reference counting is a change to the semantics of CPython. Right now, people can (and often do) assume that objects which don't participate in a reference loop are collected as soon as they go out of scope. They write code that depends on this... idioms like: >>> text_of_file = open(file_name, 'r').read() Perhaps such idioms aren't a good practice (they'd fail in Jython or in IronPython), but they ARE common. So we shouldn't stop using reference counting unless we can demonstrate that the alternative is clearly better. Of course, we'd also need to devise a way for extensions to cooperate (which is a problem Jython, at least, doesn't face). So it's NOT an obvious call, and so far numerous attempts to review other GC strategies have failed. I wouldn't be so quick to dismiss reference counting.
My only argument for making Python capable of leveraging multiple processor environments is that multithreading seems to be where the big performance increases will be in the next few years. I am currently using Python for some relatively large simulations, so performance is important to me.
CPython CAN leverage such environments, and it IS used that way. However, this requires using multiple Python processes and inter-process communication of some sort (there are lots of choices, take your pick). It's a technique which is more trouble for the programmer, but in my experience usually has less likelihood of containing subtle parallel processing bugs. Sure, it'd be great if Python threads could make use of separate CPUs, but if the cost of that were that Python dictionaries performed as poorly as a Java HashTable or synchronized HashMap, then it wouldn't be worth the cost. There's a reason why Java moved away from HashTable (the threadsafe data structure) to HashMap (not threadsafe). Perhaps the REAL solution is just a really good IPC library that makes it easier to write programs that launch "threads" as separate processes and communicate with them. No change to the internals, just a new library to encourage people to use the technique that already works. -- Michael Chermside
Michael> CPython CAN leverage such environments, and it IS used that Michael> way. However, this requires using multiple Python processes Michael> and inter-process communication of some sort (there are lots of Michael> choices, take your pick). It's a technique which is more Michael> trouble for the programmer, but in my experience usually has Michael> less likelihood of containing subtle parallel processing Michael> bugs. In my experience, when people suggest that "threads are easier than ipc", it means that their code is sprinkled with "subtle parallel processing bugs". Michael> Perhaps the REAL solution is just a really good IPC library Michael> that makes it easier to write programs that launch "threads" as Michael> separate processes and communicate with them. Tuple space, anyone? Skip
On Mon, 2005-01-31 at 08:51 -0800, Michael Chermside wrote:
However, remember that changing away from reference counting is a change to the semantics of CPython. Right now, people can (and often do) assume that objects which don't participate in a reference loop are collected as soon as they go out of scope. They write code that depends on this... idioms like:
>>> text_of_file = open(file_name, 'r').read()
Perhaps such idioms aren't a good practice (they'd fail in Jython or in IronPython), but they ARE common. So we shouldn't stop using reference counting unless we can demonstrate that the alternative is clearly better. Of course, we'd also need to devise a way for extensions to cooperate (which is a problem Jython, at least, doesn't face).
I agree that the issue is highly subtle, but this reason strikes me as kind of bogus. The problem here is not that the semantics are really different, but that Python doesn't treat file descriptors as an allocatable resource, and therefore doesn't trigger the GC when they are exhausted. As it stands, this idiom works most of the time, and if an EMFILE errno triggered the GC, it would always work. Obviously this would be difficult to implement pervasively, but maybe it should be a guideline for alternative implementations to follow so as not to fall into situations where tricks like this one, which are perfectly valid both semantically and in regular python, would fail due to an interaction with the OS...?
On Monday 31 January 2005 14:08, Glyph Lefkowitz wrote:
As it stands, this idiom works most of the time, and if an EMFILE errno triggered the GC, it would always work.
That might help things on Unix, but I don't think that's meaningful. Windows is much more sensitive to files being closed, and the refcount solution supports that more effectively than delayed garbage collection strategies. With the current approach, you can delete the file right away after releasing the last reference to the open file object, even on Windows. You can't do that with delayed GC since Windows will be convinced that the file is still open and refuse to let you delete it. To fix that, you'd have to trigger GC from the failed removal operation and try again. I think we'd find there are a lot more operations that need that support than we'd like to think. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org>
participants (4)
-
Fred L. Drake, Jr.
-
Glyph Lefkowitz
-
Michael Chermside
-
Skip Montanaro