Memory Allocator Part 2: Did I get it right?
After I finally understood what thread-safety guarantees the Python memory allocator needs to provide, I went and did some hard thinking about the code this afternoon. I believe that my modifications provide the same guarantees that the original version did. I do need to declare the arenas array to be volatile, and leak the array when resizing it. Please correct me if I am wrong, but the situation that needs to be supported is this: While one thread holds the GIL, any other thread can call PyObject_Free with a pointer that was returned by the system malloc. The following situation is *not* supported: While one thread holds the GIL, another thread calls PyObject_Free with a pointer that was returned by PyObject_Malloc. I'm hoping that I got things a little better this time around. I've submitted my updated patch to the patch tracker. For reference, I've included links to SourceForge and the previous thread. Thank you, Evan Jones Previous thread: http://mail.python.org/pipermail/python-dev/2005-January/051255.html Patch location: http://sourceforge.net/tracker/index.php? func=detail&aid=1123430&group_id=5470&atid=305470
[Evan Jones]
After I finally understood what thread-safety guarantees the Python memory allocator needs to provide, I went and did some hard thinking about the code this afternoon. I believe that my modifications provide the same guarantees that the original version did. I do need to declare the arenas array to be volatile, and leak the array when resizing it. Please correct me if I am wrong, but the situation that needs to be supported is this:
As I said before, I don't think we need to support this any more. More, I think we should not -- the support code is excruciatingly subtle, it wasted plenty of your time trying to keep it working, and if we keep it in it's going to continue to waste time over the coming years (for example, in the short term, it will waste my time reviewing it).
While one thread holds the GIL, any other thread can call PyObject_Free with a pointer that was returned by the system malloc.
What _was_ supported was more generally that any number of threads could call PyObject_Free with pointers that were returned by the system malloc/realloc at the same time as a single thread, holding the GIL, was doing anything whatsoever (including executing any code inside obmalloc.c) Although that's a misleading way of expressing the actual intent; more on that below.
The following situation is *not* supported:
While one thread holds the GIL, another thread calls PyObject_Free with a pointer that was returned by PyObject_Malloc.
Right, that was never supported (and I doubt it could be without introducing a new mutex in obmalloc.c).
I'm hoping that I got things a little better this time around. I've submitted my updated patch to the patch tracker. For reference, I've included links to SourceForge and the previous thread.
Thank you,
Thank you! I probably can't make time to review anything before this weekend. I will try to then. I expect it would be easier if you ripped out the horrid support for PyObject_Free abuse; in a sane world, the release-build PyMem_FREE, PyMem_Del, and PyMem_DEL would expand to "free" instead of to "PyObject_FREE" (via changes to pymem.h). IOW, it was never the _intent_ that people be able to call PyObject_Free without holding the GIL. The need for that came from a different problem, that old code sometimes mixed calls to PyObject_New with calls to PyMem_DEL (or PyMem_FREE or PyMem_Del). It's for that latter reason that PyMem_DEL (and its synonyms) were changed to expand to PyObject_Free. This shouldn't be supported anymore. Because it _was_ supported, there was no way to tell whether PyObject_Free was being called because (a) we were catering to long-obsolete but once-loved code that called PyMem_DEL while holding the GIL and with a pointer obtained by PyObject_New; or, (b) somebody was calling PyMem_Del (etc) with a non-object pointer they had obtained from PyMem_New, or from the system malloc directly. It was never legit to do #a without holding the GIL. It was clear as mud whether it was legit to do #b without holding the GIL. If PyMem_Del (etc) change to expand to "free" in a release build, then #b can remain clear as mud without harming anyone. Nobody should be doing #a anymore. If someone still is, "tough luck -- fix it, you've had years of warning" is easy for me to live with at this stage. I suppose the other consideration is that already-compiled extension modules on non-Windows(*) systems will, if they're not recompiled, continue to call PyObject_Free everywhere they had a PyMem_Del/DEL/FREE call. If such code is calling it without holding the GIL, and obmalloc.c stops trying to support this insanity, then they're going to grow some thread races they woudn't have if they did recompile (to get such call sites remapped to the system free). I don't really care about that either: it's a general rule that virtually all Python API functions must be called with the GIL held, and there was never an exception in the docs for the PyMem_ family. (*) Windows is immune simply because the Windows Python is set up in such a way that you always have to recompile extension modules when Python's minor version number (the j in i.j.k) gets bumped.
On Feb 15, 2005, at 17:52, Tim Peters wrote:
As I said before, I don't think we need to support this any more. More, I think we should not -- the support code is excruciatingly subtle, it wasted plenty of your time trying to keep it working, and if we keep it in it's going to continue to waste time over the coming years (for example, in the short term, it will waste my time reviewing it).
I do not have nearly enough experience in the Python world to evaluate this decision. I've only been programming in Python for about two years now, and as I am sure you are aware, this is my first patch that I have submitted to Python. I don't really know my way around the Python internals, beyond writing basic extensions in C. Martin's opinion is clearly the opposite of yours. Basically, the debate seems to boil down to maintaining backwards compatibility at the cost of making the code in obmalloc.c harder to understand. The particular case that is being supported could definitely be viewed as a "bug" in the code that using obmalloc. It also likely is quite rare. However, until now it has been supported, so it is hard to judge exactly how much code would be affected. It would definitely be a minor barrier to moving to Python 2.5. Is there some sort of consensus that is possible on this issue?
While one thread holds the GIL, any other thread can call PyObject_Free with a pointer that was returned by the system malloc. What _was_ supported was more generally that
any number of threads could call PyObject_Free with pointers that were returned by the system malloc/realloc
at the same time as
a single thread, holding the GIL, was doing anything whatsoever (including executing any code inside obmalloc.c)
Okay, good, that is what I have assumed.
Although that's a misleading way of expressing the actual intent; more on that below.
That's fine. It may be a misleading description of the intent, but it is an accurate description of the required behaviour. At least I hope it is.
I expect it would be easier if you ripped out the horrid support for PyObject_Free abuse; in a sane world, the release-build PyMem_FREE, PyMem_Del, and PyMem_DEL would expand to "free" instead of to "PyObject_FREE" (via changes to pymem.h).
It turns out that basically the only thing that would change would be removing the "volatile" specifiers from two of the global variables, plus it would remove about 100 lines of comments. :) The "work" was basically just hurting my brain trying to reason about the concurrency issues, not changing code.
It was never legit to do #a without holding the GIL. It was clear as mud whether it was legit to do #b without holding the GIL. If PyMem_Del (etc) change to expand to "free" in a release build, then #b can remain clear as mud without harming anyone. Nobody should be doing #a anymore. If someone still is, "tough luck -- fix it, you've had years of warning" is easy for me to live with at this stage.
Hmm... The issue is that case #a may not be an easy problem to diagnose: Some implementations of free() will happily do nothing if they are passed a pointer they know nothing about. This would just result in a memory leak. Other implementations of free() can output a warning or crash in this case, which would make it trivial to locate.
I suppose the other consideration is that already-compiled extension modules on non-Windows(*) systems will, if they're not recompiled, continue to call PyObject_Free everywhere they had a PyMem_Del/DEL/FREE call.
Is it guaranteed that extension modules will be binary compatible with future Python releases? I didn't think this was the case. Thanks for the feedback, Evan Jones
[Tim Peters]
As I said before, I don't think we need to support this any more. More, I think we should not -- the support code is excruciatingly subtle, it wasted plenty of your time trying to keep it working, and if we keep it in it's going to continue to waste time over the coming years (for example, in the short term, it will waste my time reviewing it).
[Evan Jones]
I do not have nearly enough experience in the Python world to evaluate this decision. I've only been programming in Python for about two years now, and as I am sure you are aware, this is my first patch that I have submitted to Python. I don't really know my way around the Python internals, beyond writing basic extensions in C. Martin's opinion is clearly the opposite of yours.
? This is all I recall Martin saying about this: http://mail.python.org/pipermail/python-dev/2005-January/051265.html I'm not certain it is acceptable to make this assumption. Why is it not possible to use the same approach that was previously used (i.e. leak the arenas array)? Do you have something else in mind? I'll talk with Martin about it if he still wants to. Martin, this miserable code must die!
Basically, the debate seems to boil down to maintaining backwards compatibility at the cost of making the code in obmalloc.c harder to understand.
The "let it leak to avoid thread problems" cruft is arguably the single most obscure bit of coding in Python's code base. I created it, so I get to say that <wink>. Even 100 lines of comments aren't enough to make it clear, as you've discovered. I've lost track of how many hours of my life have been pissed away explaining it, and its consequences (like how come this or that memory-checking program complains about the memory leak it causes), and the historical madness that gave rise to it in the beginning. I've had enough of it -- the only purpose this part ever had was to protect against C code that wasn't playing by the rules anyway. BFD. There are many ways to provoke segfaults with C code that breaks the rules, and there's just not anything that special about this way _except_ that I added objectionable (even at the time) hacks to preserve this kind of broken C code until authors had time to fix it. Time's up.
The particular case that is being supported could definitely be viewed as a "bug" in the code that using obmalloc. It also likely is quite rare. However, until now it has been supported, so it is hard to judge exactly how much code would be affected.
People spent many hours searching for affected code when it first went in, and only found a few examples then, in obscure extension modules. It's unlikely usage has grown. The hack was put it in for the dubious benefit of the few examples that were found then.
It would definitely be a minor barrier to moving to Python 2.5.
That's in part what python-dev is for. Of course nobody here has code that will break -- but the majority of high-use extension modules are maintained by people who read this list, so that's not as empty as it sounds. It's also what alpha and beta releases are for. Fear of change isn't a good enough reason to maintain this code.
Is there some sort of consensus that is possible on this issue?
Absolutely, provided it matches my view <0.5 wink>. Rip it out, and if alpha/beta testing suggests that's a disaster, _maybe_ put it back in. ...
It turns out that basically the only thing that would change would be removing the "volatile" specifiers from two of the global variables, plus it would remove about 100 lines of comments. :) The "work" was basically just hurting my brain trying to reason about the concurrency issues, not changing code.
And the brain of everyone else who ever bumps into this. There's a high probability that if this code actually doesn't work (can you produce a formal proof of correctness for it? I can't -- and I tried), nothing can be done to repair it; and code this outrageously delicate has a decent chance of being buggy no matter how many people stare at it (overlooking that you + me isn't that many). You also mentioned before that removing the "volatile"s may have given a speed boost, and that's believable. I mentioned above the unending costs in explanations, and nuisance gripes from memory-integrity tools about the deliberate leaks. There are many kinds of ongoing costs here, and no _intended_ benefit anymore (it certainly wasn't my intent to cater to buggy C code forever).
It was never legit to do #a without holding the GIL. It was clear as mud whether it was legit to do #b without holding the GIL. If PyMem_Del (etc) change to expand to "free" in a release build, then #b can remain clear as mud without harming anyone. Nobody should be doing #a anymore. If someone still is, "tough luck -- fix it, you've had years of warning" is easy for me to live with at this stage.
Hmm... The issue is that case #a may not be an easy problem to diagnose:
Many errors in C code are difficult to diagnose. That's life. Mixing a PyObject call with a PyMem call is obvious now "by eyeball", so if there is such code still out there, and it blows up, an experienced eye has a good chance of spotting the error at once. '
Some implementations of free() will happily do nothing if they are passed a pointer they know nothing about. This would just result in a memory leak. Other implementations of free() can output a warning or crash in this case, which would make it trivial to locate.
I expect most implementations of free() would end up corrupting memory state, leading to no symptoms or to disastrous symptoms, from 0 to a googol cycles after the mistake was made. Errors in using malloc/free are often nightmares to debug. We're not trying to make coding in C pleasant here -- which is good, because that's unachievable <wink>.
I suppose the other consideration is that already-compiled extension modules on non-Windows(*) systems will, if they're not recompiled, continue to call PyObject_Free everywhere they had a PyMem_Del/DEL/FREE call.
Is it guaranteed that extension modules will be binary compatible with future Python releases? I didn't think this was the case.
Nope, that's not guarantfeed. There's a magic number (PYTHON_API_VERSION) that changes whenever the Python C API undergoes an incompatible change, and binary compatibility is guaranteed across releases if that doesn't change. The then-current value of PYTHON_API_VERSION gets compiled into extensions, by virtue of the module-initialization macro their initialization function has to call. The guts of that function are in the Python core (Py_InitModule4()), which raises this warning if the passed-in version doesn't match the current version: "Python C API version mismatch for module %.100s:\ This Python has API version %d, module %.100s has version %d."; This is _just_ a warning, though. Perhaps unfortunately for Python's users, Guido learned long ago that most API mismatches don't actually matter for his own code <wink>. For example, the C API officially changed when the signature of PyFrame_New() changed in 2001 -- but almost no extension modules call that function. Similarly, if we change PyMem_Del (etc) to map to the system free(), PYTHON_API_VERSION should be bumped for Python 2.5 -- but many people will ignore the mismatch warning, and again it will probably make no difference (if there's code still out there that calls PyMem_DEL (etc) without holding the GIL, I don't know about it).
Tim Peters wrote:
I'm not certain it is acceptable to make this assumption. Why is it not possible to use the same approach that was previously used (i.e. leak the arenas array)?
Do you have something else in mind? I'll talk with Martin about it if he still wants to. Martin, this miserable code must die!
That's fine with me. I meant what I said: "I'm not certain". The patch original claimed that it cannot possibly preserve this feature, and I felt that this claim was incorrect - indeed, Evan then understood the feature, and made it possible. I can personally accept breaking the code that still relies on the invalid APIs. The only problem is that it is really hard to determine whether some code *does* violate the API usage. Regards, Martin
Sorry for taking so long to get back to this thread, it has been one of those weeks for me. On Feb 16, 2005, at 2:50, Martin v. Löwis wrote:
Evan then understood the feature, and made it possible.
This is very true: it was a very useful exercise.
I can personally accept breaking the code that still relies on the invalid APIs. The only problem is that it is really hard to determine whether some code *does* violate the API usage.
Great. Please ignore the patch on SourceForge for a little while. I'll produce a "revision 3" this weekend, without the compatibility hack. Evan Jones
Evan Jones wrote:
Sorry for taking so long to get back to this thread, it has been one of those weeks for me.
On Feb 16, 2005, at 2:50, Martin v. Löwis wrote:
Evan then understood the feature, and made it possible.
This is very true: it was a very useful exercise.
I can personally accept breaking the code that still relies on the invalid APIs. The only problem is that it is really hard to determine whether some code *does* violate the API usage.
Great. Please ignore the patch on SourceForge for a little while. I'll produce a "revision 3" this weekend, without the compatibility hack.
Evan uploaded the newest version (@ http://www.python.org/sf/1123430) and he says it is "complete". Assuming a code review says the patch is sane, do we want to go with this garbage collection change? From past discussions I don't remember a consensus on acceptance or rejection, just lots of discussion about ripping out the hacks to allow freeing memory w/o holding the GIL (I assume Evan's patch rips that code out). -Brett
Brett C. wrote:
Assuming a code review says the patch is sane, do we want to go with this garbage collection change? From past discussions I don't remember a consensus on acceptance or rejection, just lots of discussion about ripping out the hacks to allow freeing memory w/o holding the GIL (I assume Evan's patch rips that code out).
I think the consensus is that the feature is desirable. So if the code is correct, it should be checked in. Regards, Martin
On Mar 4, 2005, at 23:55, Brett C. wrote:
Evan uploaded the newest version (@ http://www.python.org/sf/1123430) and he says it is "complete".
I should point out (very late! It's been a busy couple of weeks) that I fully expect that there may be some issues with the patch that will arise when it is reviewed. I am still lurking on Python-Dev, and I will strive to correct any defects ASAP. A few users have contacted me privately and have tested the patch, so it works for a few people at least. Evan Jones
participants (4)
-
"Martin v. Löwis"
-
Brett C.
-
Evan Jones
-
Tim Peters