Recall that pymalloc delegates requests for "not small" blocks to the system malloc. This means that when pymalloc's free() is called, it has to figure out whether the address passed to it was one it allocated itself, or came from the system malloc. It uses 64 bits of magical info to determine this, based in part on address calculation. Vladimir claimed, and I agreed, that the chance of making a mistake here was insignificant. However, the question remained whether a hostile user could study the source code to mount a successful attack. Alas, it turns out that's easy. The attached was my first brute-force try at tricking pymalloc via pure Python code, and should break it quickly (within a few seconds) on any 32-bit little-endian box. It usually dies with a memory fault. I've also seen it trigger a bogus "unhashable type" error, and most worrisome a "SystemError: unknown opcode" error (showing it's possible to mutate bytecode via driving pymalloc insane). I ran this on Win98SE, and it seems it's also possible to provoke pymalloc into convincing the OS that the computer's modem is permanently busy (! of course a reboot fixed that). Given that there are other ways to provoke Python into crashing (and some much easier than this way), I don't know how seriously to take this. Historically, I'm usually hugging the end of the "who cares?" scale (I'm not sure I've ever seen a system I couldn't crash with a little effort -- and I have no idea how to stop me, let alone someone who's really keen to do damage). Please fight amongst yourselves <wink>.
[Tim]
Recall that pymalloc delegates requests for "not small" blocks to the system malloc. This means that when pymalloc's free() is called, it has to figure out whether the address passed to it was one it allocated itself, or came from the system malloc. ... However, the question remained whether a hostile user could study the source code to mount a successful attack. Alas, it turns out that's easy.
There may be hope for this, and another strong reason for pursuing it. Guido pointed out that *if* pymalloc's free() could know for sure whether an address passed to it did or didn't come from the system malloc, then we could also (probably -- needs some more thought) continue to use the PyObject interface, *without* breaking most "illegal by fiat" old code, via also mapping PyMem_{Free, FREE, Del, DEL} to the pymalloc free (when pymalloc is enabled): then the various PyMem_NukeIt spellings would *know* whether to pass a piece of memory on to the system free(). So I have a variant of pymalloc that doesn't use magic cookies -- it knows "for sure", by maintaining a sorted (by address) contiguous vector of the base addresses of the (256KB each) "arenas" allocated by pymalloc. A given address A either does or doesn't satisfy B <= A < B+256KB for some base address B in this list. At worst, a binary search is needed to see whether any such B exists. For a typical 2**31 byte VM address space, we can get at most 2**31/2**18 == 2**13 arenas, so at worst this vector can grow to a measly 8K entries (typically far fewer), and 13 is a reasonable upper bound on the number of compares needed. So it can't be a disaster, but neither is it dirt cheap. A promising candidate for saving grace: in tests so far, frees tend to hit the same arena many times in a row. By saving a search finger into the vector that remembers the last hit index, and checking that location first, I'm seeing hit rates over 85%. The two-sided range test at the finger index can be inlined, and the two compares it requires are exactly as many compares as the current magic-cookie tests do. So, most of the time, it may run as fast as now. For an address pymalloc *doesn't* handle on its own, though, a full binary search is required to discover that it's not a pymalloc address. Leaving at least one magic-cookie gimmick in could slash that (if the magic cookie isn't there, we know for sure it's not a pymalloc address). So the test becomes something like def PyMalloc_Free(thisaddr): # ... weed out NULL ... # not NULL if (addrs[finger] <= thisaddr < addrs[finger] + 256KB or (thisaddr_has_magic_cookie and binary_search_for(thisaddr)): it's a pymalloc addr else it goes to the system free() That gives a likely fast-path for each class of address, and should never err (provided we're passed legitimate addresses, and user code doesn't write out of bounds, etc). A nasty subtlety: if PyMem_NukeIt is called when the GIL isn't held, it's going to take a lot of thought to see whether this can be made safe when called simultaneously by multiple threads; e.g., the finger can change "mid-stream" then, and, worse, some thread that *does* hold the GIL may legitimately cause a new arena to be allocated midstream, mutating the address vector during the search. This may be intractable enough to kill the idea of mapping PyMem_XXX to this. A nasty trick: due to the wonders of unsigned arithmetic, the two-sided range test can be reduced to one compare: (Py_uintptr)thisaddr - (Py_uintptr)addrs[finger] < 256KB In other news, "it's obvious" that the small object threshhold is too small for 2.3. I'm still inclined to max it out.
Some wild-ass thoughts: On Fri, 29 Mar 2002, Tim Peters wrote:
A nasty subtlety: if PyMem_NukeIt is called when the GIL isn't held, it's going to take a lot of thought to see whether this can be made safe when called simultaneously by multiple threads; e.g., the finger can change "mid-stream"
Then make multiple thread-specific fingers, which will presumably result in higher hit rates due to better locality. To prevent fingers from being invalidated, do not remove arenas that are deallocated from the tree -- just mark them inactive.
then, and, worse, some thread that *does* hold the GIL may legitimately cause a new arena to be allocated midstream, mutating the address vector during the search. This may be intractable enough to kill the idea of mapping PyMem_XXX to this.
Why not use a smaller (non-global) lock to protect arena modification and finger flushing? Ideally, the lock should be cheap for readers (which I presume will be the vast majority of accesses) and relatively expensive for writers (which I assume will occur only when arenas are added or deactivated). -Kevin -- Kevin Jacobs The OPAL Group - Enterprise Systems Architect Voice: (216) 986-0710 x 19 E-mail: jacobs@theopalgroup.com Fax: (216) 986-0714 WWW: http://www.theopalgroup.com
[Kevin Jacobs]
Then make multiple thread-specific fingers, which will presumably result in higher hit rates due to better locality.
But at the cost of significantly slowing every test. Python has no efficient TLS abstraction: we'd have to call get_thread_ident() every time, then look up the finger in a dict devoted to this purpose. All that is likely as expensive as a binary search.
To prevent fingers from being invalidated, do not remove arenas that are deallocated from the tree -- just mark them inactive.
That part's already done, except we save time by not bothering to mark them inactive <wink>. That is, pymalloc has never returned arenas to the system, and still doesn't.
... Why not use a smaller (non-global) lock to protect arena modification and finger flushing?
I don't know what "finger flushing" means. Note that Python has no efficient x-platform lock abstraction either. Note that every suggestion you dream up is competing with a gimmick that's *very* fast now in typical cases. If you can't picture the exact machine instructions generated and hold them in your head all at once with ease, it's probably too expensive to consider.
There may be hope for this, and another strong reason for pursuing it. [good stuff snipped]
A nasty subtlety: if PyMem_NukeIt is called when the GIL isn't held, it's going to take a lot of thought to see whether this can be made safe when called simultaneously by multiple threads; e.g., the finger can change "mid-stream" then, and, worse, some thread that *does* hold the GIL may legitimately cause a new arena to be allocated midstream, mutating the address vector during the search. This may be intractable enough to kill the idea of mapping PyMem_XXX to this.
I know of two places that calls PyMem_Malloc outside the GIL: PyOS_StdioReadline in Parser/myreadline.c and call_readline() in Modules/readline.c. The memory thus allocated is returned by PyOS_Readline() (which calls either function through a hook pointer), and both its callers (the tokenizer and raw_input()) free the result using PyMem_DEL or PyMem_FREE (these two seem to be used synonymically). The tokenizer code may also apply PyMem_RESIZE to it (it incorporated the input line in its buffer structure). This would have to be fixed by changing the allocations to use malloc() (as they did up to 1.5.2 :-) and by changing the consumers to use free() and realloc(). (An alternative would be to let PyOS_Readline() copy the memory to PyMem_Malloc'ed memory.) This is part of a "hook" API that allows 3rd parties to provide their own alternative to PyOS_Readline. This was put in at the request of some folks at LLNL who were providing their own GUI that fed into Python and who had some problems with sending it to stdin. I don't think anybody else has used this. There is not a single mention of PyOS_Readline in the entire set of Python documentation. Given the alternatives: 1. introduce new APIs PyMalloc_{New,Del}Object and tell all extension writers that they have to changes their extensions once again to use these brand new APIs if they want to benefit from pymalloc; or 2. fix the issues with PyOS_Readline, make PyMem_{DEL,FREE,Del,Free} synonyms for Tim's clever PyMem_NukeIt, and continue to promote using PyObject_{New,Del} for use by extension writers; I'm all for #2. --Guido van Rossum (home page: http://www.python.org/~guido/)
[Guido, points out a bunch of memory mgmt problems with myreadline.c and PyOS_StdioReadline, which I might understand if I devoted April to it <0.9 wink>] I believe the last suggestion I posted gets much closer to allowing any mixture of PyMem_XXX with platform malloc/realloc/free without thread issues. Seems the only trouble can come if a PyObject_{New, NewVar, Malloc, Realloc} needs to allocate a new arena *and* needs to grow the vector of arena base addresses, at the same time some jackass calls PyMem_{DEL, Del, FREE, Free} in another thread without holding the GIL. There's a tiny window of vulnerability then in the latter guy, as the base-address vector may shift in memory while the vector is growing, leaving the latter guy indexing into suddenly-stale memory. This would have to happen in a window of about 3 machine instructions to do any harm, at the same time the base-address vector is moving. Yuck.
I know of two places that calls PyMem_Malloc outside the GIL: PyOS_StdioReadline in Parser/myreadline.c and call_readline() in Modules/readline.c. The memory thus allocated is returned by PyOS_Readline() (which calls either function through a hook pointer), and both its callers (the tokenizer and raw_input()) free the result using PyMem_DEL or PyMem_FREE (these two seem to be used synonymically). The tokenizer code may also apply PyMem_RESIZE to it (it incorporated the input line in its buffer structure).
This would have to be fixed by changing the allocations to use malloc() (as they did up to 1.5.2 :-) and by changing the consumers to use free() and realloc(). (An alternative would be to let PyOS_Readline() copy the memory to PyMem_Malloc'ed memory.)
This is the kind of heroic effort I don't want to impose on users: you have encyclopedic knowledge of how the Python implementation may be abusing this stuff, and you *invented* the rules <wink>. Random extension authors are going to have a much harder time of it -- as far as they're concerned, PyMem_{DEL, FREE, Del, Free} are all just ways to spell "platform free(), but I'm not supposed to call free() directly for some reason I don't understand -- I think it might have had something to do with DLLs on Windows".
This is part of a "hook" API that allows 3rd parties to provide their own alternative to PyOS_Readline. This was put in at the request of some folks at LLNL who were providing their own GUI that fed into Python and who had some problems with sending it to stdin. I don't think anybody else has used this. There is not a single mention of PyOS_Readline in the entire set of Python documentation.
Well, neither is there any mention of possibly abusive functions in hundreds of extension modules we've never heard of.
Given the alternatives:
1. introduce new APIs PyMalloc_{New,Del}Object and tell all extension writers that they have to changes their extensions once again to use these brand new APIs if they want to benefit from pymalloc; or
2. fix the issues with PyOS_Readline, make PyMem_{DEL,FREE,Del,Free} synonyms for Tim's clever PyMem_NukeIt, and continue to promote using PyObject_{New,Del} for use by extension writers;
I'm all for #2.
You're presenting #1 as user-hostile and #2 as user-friendly. But if part of #2 is also saying that it's now definitely illegal to call PyMem_{Free, FREE, Del, DEL} without holding the GIL, and horrible things may start to happen in 2.3 if you're doing so, then it's also user-hostile in that respect. #1 is user-friendly in the "nothing breaks" sense. I haven't given up on combining the best of both, but I am getting close <wink>.
[Tim]
This is the kind of heroic effort I don't want to impose on users: you have encyclopedic knowledge of how the Python implementation may be abusing this stuff, and you *invented* the rules <wink>. Random extension authors are going to have a much harder time of it -- as far as they're concerned, PyMem_{DEL, FREE, Del, Free} are all just ways to spell "platform free(), but I'm not supposed to call free() directly for some reason I don't understand -- I think it might have had something to do with DLLs on Windows".
On the other hand, the less sophisticated extension writers never release the GIL at all. [me]
Given the alternatives:
1. introduce new APIs PyMalloc_{New,Del}Object and tell all extension writers that they have to changes their extensions once again to use these brand new APIs if they want to benefit from pymalloc; or
2. fix the issues with PyOS_Readline, make PyMem_{DEL,FREE,Del,Free} synonyms for Tim's clever PyMem_NukeIt, and continue to promote using PyObject_{New,Del} for use by extension writers;
I'm all for #2.
You're presenting #1 as user-hostile and #2 as user-friendly. But if part of #2 is also saying that it's now definitely illegal to call PyMem_{Free, FREE, Del, DEL} without holding the GIL, and horrible things may start to happen in 2.3 if you're doing so, then it's also user-hostile in that respect. #1 is user-friendly in the "nothing breaks" sense. I haven't given up on combining the best of both, but I am getting close <wink>.
All I can do is encourage you to keep trying. I find it quite encouraging that you found a constant-time test for non-arena memory at all. Making it thread-safe should only add a tiny constant time. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum <guido@python.org> writes:
All I can do is encourage you to keep trying. I find it quite encouraging that you found a constant-time test for non-arena memory at all. Making it thread-safe should only add a tiny constant time.
Are we discussing using PyMalloc even when not holding the GIL? That can't possibly work: pymalloc maintains linked lists of free memory; if two allocate or deallocate objects in the same pool simultaneously, bad things will happen. Regards, Martin
[martin@v.loewis.de]
Are we discussing using PyMalloc even when not holding the GIL? That can't possibly work: pymalloc maintains linked lists of free memory; if two allocate or deallocate objects in the same pool simultaneously, bad things will happen.
I checked in obmalloc changes sufficient so that PyObject_XXX *could* be changed to use pymalloc again, and not break if a PyObject "get memory" function is mixed with a PyMem "free memory" function. This requires (in part) that all the latter be redirected to use the (new, as of a couple hours ago) pymalloc free. This leaves one known hole for backward compatibility: some people call a PyMem "get memory" routine when not holding the GIL. That's still no problem, because the PyMem "get memory" routines would still call the system malloc without pymalloc getting involved. Some people also call a PyMem "free memory" routine when not holding the GIL, but only on memory that was obtained from a PyMem "get memory" routine. That's the potential problem. The PyMem "free memory" routines would invoke pymalloc's free, and the mere act of *checking* the address passed to PyMem free() is vulnerable if the GIL isn't held. It's not obvious how it's vulnerable, so I'll spell it out: Some thread that does hold the GIL and simultaneously calls a PyObject "get memory" routine ends up in the pymalloc malloc, and *if* that needs to allocate a new arena, and *if* that in turn requires growing the vector of arena base addresses, then there appears to be no way (short of slopping locks on everything) to guarantee that the *other* thread (in pymalloc free) won't pick up an address for the arena-base-address vector that's invalid by the time it gets around to indexing into it. It's an extremely small hole (just a couple machine instructions), and requires an extremely unlikely set of circumstances to make it possible to fall into it, but it is a real hole all the same. There wouldn't be a hole if the vector of arena base addresses were statically allocated.
----- Original Message ----- From: "Tim Peters" <tim.one@comcast.net>
There wouldn't be a hole if the vector of arena base addresses were statically allocated.
This could be a reason to use the 2-level arena vector we were discussing earlier. I was just assuming you wouldn't ever want to reallocate it or move anything that big... -Dave
There wouldn't be a hole if the vector of arena base addresses were statically allocated.
[David Abrahams]
This could be a reason to use the 2-level arena vector we were discussing earlier. I was just assuming you wouldn't ever want to reallocate it or move anything that big...
A 2-level vector would complicate the code in ways that slow it down in the usual cases, so is unattractive. These vectors are generally *small*: on a 32-bit box, 1024 bytes of base-address vector is enough for that over 4 = 256 base addresses, which is enough to cover 2*8 * 2**18 = 64 megabytes of arena memory. Most programs will never go above that, and a one-time cost of moving 1KB per 64MB is trivial. It's quite attractive to just let the old vectors leak.
It's quite attractive to just let the old vectors leak.
Yes. If you double the vector's size each time it needs to be moved, the leakage is a small percentage of allocated memory. --Guido van Rossum (home page: http://www.python.org/~guido/)
Heh -- letting old vectors leak may still not be good enough. Martin posted the relevant code: ((I) < narenas && (uptr)(P) - arenas[I] < (uptr)ARENA_SIZE) The problem remaining is that narenas can get out of synch with arenas, and if narenas refers to a later arena size, arenas[I] may be an out-of-bounds reference. This may be a good reason to make "narenas" and "arenas" volatile, in order to ensure that the former is read up before the latter.
[martin@v.loewis.de]
It would not be vulnerable if you would not free the old arena list, right?
Bingo! That occurred to me when I woke up. As I went to sleep, I was picturing an elaborate scheme of saving the "last N" vectors, and recycling them in tricky ways. Overnight, sleep simplified the recycling scheme into an empty statement <wink>.
Although I'd declare arenas as volatile...
Why? Thread problems aren't solved by superstition, and that's what adding "volatile" *usually* is. It's generally no more of a cure than adding a sleep(). More on that in my reply to Guido.
It's not obvious how it's vulnerable, so I'll spell it out: Some thread that does hold the GIL and simultaneously calls a PyObject "get memory" routine ends up in the pymalloc malloc, and *if* that needs to allocate a new arena, and *if* that in turn requires growing the vector of arena base addresses, then there appears to be no way (short of slopping locks on everything) to guarantee that the *other* thread (in pymalloc free) won't pick up an address for the arena-base-address vector that's invalid by the time it gets around to indexing into it. It's an extremely small hole (just a couple machine instructions), and requires an extremely unlikely set of circumstances to make it possible to fall into it, but it is a real hole all the same.
How about if the PyMem_Free guy saved the address of the vector before using it, and checked that it was still the same afterwards, *and* if the PyMem_Malloc guy didn't use realloc to resize the vector but copied it to a newly malloc'ed vector, stored the new vector's address, and then freed the old vector? (I guess this is the same as what Martin suggested. His point that the global holding the vector address should be declared volatile is a good one.) --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum <guido@python.org> writes:
How about if the PyMem_Free guy saved the address of the vector before using it, and checked that it was still the same afterwards, *and* if the PyMem_Malloc guy didn't use realloc to resize the vector but copied it to a newly malloc'ed vector, stored the new vector's address, and then freed the old vector?
That doesn't really help. The code currently does ((I) < narenas && (uptr)(P) - arenas[I] < (uptr)ARENA_SIZE) So if the PyMem_Free thread blocks (yields by OS command) after fetching arenas, but before fetching arenas[i], then the PyMem_Malloc thread could still free the memory under it. Regards, Martin
[Guido]
How about if the PyMem_Free guy saved the address of the vector before using it, and checked that it was still the same afterwards, *and* if the PyMem_Malloc guy didn't use realloc to resize the vector but copied it to a newly malloc'ed vector, stored the new vector's address, and then freed the old vector?
It already does the latter (it deliberately avoided realloc all along -- an even stronger reason to avoid realloc is that if the realloc failed, we'd lose the original vector of base addresses entirely and so would have to kill the program; as is, it can continue gracefully).
(I guess this is the same as what Martin suggested.
He suggested letting the old vector *leak*, which is the same idea God gave me in my sleep <wink>.
His point that the global holding the vector address should be declared volatile is a good one.)
I see no point to that. If someone can explain *why* it's a good idea, sure, but I don't see it. Let's simplify things enormously: p is a pointer to "something". Thread A contains *p Thread B contains free(p) Without a lock, they're screwed period. Thread A has to first read up p, and then dereference it, and all the volatiles in the universe can't stop free(p) from happening *between* thread A's two machine instructions. It doesn't matter either if Thread B does oldp = p; p = newp; free(oldp); because thread A can still pick up "p" before the "p = newp" line. Indeed, the current obmalloc code is exactly like that, and it's not safe (although I bet if I left it alone, the hole wouldn't be hit in my lifetime!). Neither would it do any real good for Thread A to do if (p == p && p == p && p == p && wink) ... *p ... any number of times, volatile or not. Nothing short of a pair of critical sections is a genuine cure for a genuine race. Letting the old vector leak is fine; in most programs, it won't lose more than about 1KB of memory total. Note that there are obscure reasons for why it's OK to reference a stale vector in the cases where a race is possible (this is already explained in obmalloc.c comments, so I won't repeat that here).
[Guido]
On the other hand, the less sophisticated extension writers never release the GIL at all.
That's an excellent point! I don't know how much to bank it, but it's semi-reassuring.
All I can do is encourage you to keep trying. I find it quite encouraging that you found a constant-time test for non-arena memory at all.
David Abrahams had the key insight there. Those C++ guys really do have some good ideas, if you can whittle away all the baffling syntax to get at the pure tiny thoughts they're thinking <wink>.
Making it thread-safe should only add a tiny constant time.
It does now, although it's balancing on a nanometer pinhead. That's good enough so long as nobody sneezes in the general direction of the code.
Tim Peters <tim.one@comcast.net> writes:
So I have a variant of pymalloc that doesn't use magic cookies -- it knows "for sure", by maintaining a sorted (by address) contiguous vector of the base addresses of the (256KB each) "arenas" allocated by pymalloc.
I would not like to see such a binary search performed. Instead, if anything needs to be done, I'd be in favour of using a constant-time algorithm, even if it means that a littler more memory overhead is necessary. I have the following idea: each chunk allocated by ought to have a pointer to its pool, immediately preceding the memory block. This will make an overhead of 4 bytes per allocation. Chunks allocated by the system allocator will have a null pointer preceding them. To deal with alignment, the size classes would increase each by 4 bytes, so they would be spaced at 12, 20, 28, etc. bytes. With the 4 byte overallocation, each chunk would be 8-aligned, if the previous chunk in the same pool is. This approach would allow to remove the requirement that each pool must be 4k-aligned. To support the NULL pointer in a system-malloc'ed chunk, pymalloc would overallocate 8 bytes if it defers to system malloc, to preserve alignment. What do you think? Regards, Martin
[martin@v.loewis.de]
I would not like to see such a binary search performed.
Instead, if anything needs to be done, I'd be in favour of using a constant-time algorithm, even if it means that a littler more memory overhead is necessary.
I have the following idea: each chunk allocated by ought ^ Allocated by what? Some crucial words are missing here. Note that I am *not* proposing to change PyMem_{MALLOC, Malloc): that continues to map
Duh <wink>. directly to system malloc(), so we have no control over, nor even knowledge of, the memory allocated by PyMem_{MALLOC, Malloc}. We can't redirect those to pymalloc's malloc without introducing locks to prevent multithreaded insanity, or "declaring by fiat" that existing working code is now horribly broken in ways we can't help users detect.
to have a pointer to its pool, immediately preceding the memory block.
In what fundamental way does this differ from pymalloc's current scheme of storing the pool address at an address calculated *from* the pointer address? I suspect the real difference resides in the next point:
This will make an overhead of 4 bytes per allocation. Chunks allocated by the system allocator will have a null pointer preceding them.
That's the killer. We have no control over "the system allocator", by which I mean libc's malloc() and realloc(). As above, trying to hijack them is unattractive due to thread issues, even for just the PyMem_{Malloc, MALLOC, Realloc, REALLOC, New, NEW, Resize, RESIZE} spellings.
To deal with alignment, the size classes would increase each by 4 bytes, so they would be spaced at 12, 20, 28, etc. bytes. With the 4 byte overallocation, each chunk would be 8-aligned, if the previous chunk in the same pool is.
*If* we take over the system malloc, we need also to mimic the platform malloc's alignment. Pymalloc's object allocator can get away with 8-byte alignment because the core never requires worse-than-double alignment, and I'm willing to say that this is a language implementation restriction. I don't believe we can also restrict PyMem_umpteen_ways_to_spell_get_memory that way: they have always given platform-malloc alignment. If we could, I agree your scheme is fine, although I'm not sure why it stores a whole pointer when it's making a 1-bit distinction ("does or does not reside in a pymalloc pool").
This approach would allow to remove the requirement that each pool must be 4k-aligned.
OK, that's why it really stores a whole pointer. Agreed then, but the memory lost to achieve page alignment once per-arena is likely small compared to adding 4 bytes to every allocation unit. The former loses at most 4KB per 256KB.
To support the NULL pointer in a system-malloc'ed chunk, pymalloc would overallocate 8 bytes if it defers to system malloc, to preserve alignment.
What do you think?
Primarily that hijacking the system malloc is fraught with new difficulties. If we *can* hijack it, so that pymalloc's free always knows that addresses passed to it came from pymalloc's malloc/realloc originally, then I agree it's easy to do something cheap to distinguish pool from non-pool memory. The scheme I sketched is as hairy as it is exactly because it does the right thing even when freeing an address obtained directly from the platform malloc/realloc.
Tim Peters <tim.one@comcast.net> writes:
Instead, if anything needs to be done, I'd be in favour of using a constant-time algorithm, even if it means that a littler more memory overhead is necessary.
I have the following idea: each chunk allocated by ought ^ Allocated by what? Some crucial words are missing here.
Sorry: each chunk allocated by pymalloc ought to have a pointer to its pool.
Note that I am *not* proposing to change PyMem_{MALLOC, Malloc)
I understand that (I think). Neither do I.
to have a pointer to its pool, immediately preceding the memory block.
In what fundamental way does this differ from pymalloc's current scheme of storing the pool address at an address calculated *from* the pointer address? I suspect the real difference resides in the next point:
This will make an overhead of 4 bytes per allocation. Chunks allocated by the system allocator will have a null pointer preceding them.
That's the killer. We have no control over "the system allocator", by which I mean libc's malloc() and realloc().
We sure do. In _PyMalloc_Malloc, replace the line return (void *)PyMem_MALLOC(nbytes); with void **result; result = (void**)PyMem_MALLOC(nbytes+8); result[1] = NULL; return (result+2); Likewise, replace offset = (off_t )p & POOL_SIZE_MASK; pool = (poolp )((block *)p - offset); if (pool->pooladdr != pool || pool->magic != (uint )POOL_MAGIC) { PyMem_FREE(p); return; } with void **block = (void**)p; if(block[-1] == NULL){ PyMem_FREE(block-2); return; }
As above, trying to hijack them is unattractive due to thread issues, even for just the PyMem_{Malloc, MALLOC, Realloc, REALLOC, New, NEW, Resize, RESIZE} spellings.
I can't see how this approach would add thread issues.
*If* we take over the system malloc, we need also to mimic the platform malloc's alignment.
I'm not sure what you mean by "take over". It will still be used as an allocator for arenas, and for large requests.
Pymalloc's object allocator can get away with 8-byte alignment because the core never requires worse-than-double alignment, and I'm willing to say that this is a language implementation restriction. I don't believe we can also restrict PyMem_umpteen_ways_to_spell_get_memory that way
I'm not proposing anything like this.
This approach would allow to remove the requirement that each pool must be 4k-aligned.
OK, that's why it really stores a whole pointer. Agreed then, but the memory lost to achieve page alignment once per-arena is likely small compared to adding 4 bytes to every allocation unit. The former loses at most 4KB per 256KB.
Indeed, this approach might be slightly more expensive. In return, it is safe and (time-)efficient. It is actually not that much more expensive. For 96-bytes objects, you get 42 objects in a pool. With my change, 96-byte objects fall into the 100 byte size class, requiring 104 bytes per object, allowing for 39 objects in a pool, thus wasting 3*96=288 bytes, or 18KB per 256KB. For 100 byte objects, nothing is wasted, since there is already 4 byte padding in pymalloc today.
Primarily that hijacking the system malloc is fraught with new difficulties.
That might be, but I'm not proposing this (or I don't understand the work "hijacking").
The scheme I sketched is as hairy as it is exactly because it does the right thing even when freeing an address obtained directly from the platform malloc/realloc.
That safe-guard isn't needed, IMO. Returning memory allocated directly through malloc (and not via _PyMalloc_Malloc) to _PyMalloc_Free is clearly a bug, and there are more efficient ways to detect this kind of bug (like your pymalloc debug mechanism). In addition, there are so many other kinds of bugs that this doesn't protect against (like not taking the GC header into account when releasing memory) that this alone isn't convincing. Regards, Martin
[Tim]
Note that I am *not* proposing to change PyMem_{MALLOC, Malloc)
[martin@v.loewis.de]
I understand that (I think). Neither do I.
I think you have to under this scheme, although I'm not clear on what "the problem" is you're trying to solve. I'm trying to cater to allocating via PyMem_{MALLOC, Malloc} then free'ing via _PyMalloc_Free. That latter is because mixing PyObject_New with PyMem_{DEL, FREE, Del, Free} is frequent in real-life code now. If Guido is to get his wish that we keep the PyObject_XXX interface and not introduce a new PyMalloc_XXX object interface, and I'm to get my wish that we don't break mountains of existing works-fine code, that means PyObject_XXX *and* PyMem_{DEL, FREE, Del, Free} have to funnel thru the pymalloc package. Else existing "get object" and "free object" mixtures will break. That in turn means memory allocated by PyMem_{Malloc, MALLOC, Realloc, REALLOC, Resize, RESIZE, New, NEW} *also* has to be suitable for feeding into pymalloc's free. But I'm not proposing to change PyMem_{Malloc, MALLOC, Realloc, REALLOC, Resize, RESIZE, New, NEW} at all, meaning they'll continue to call platform malloc() and realloc() directly without any tricky wrappers.
That's the killer. We have no control over "the system allocator", by which I mean libc's malloc() and realloc().
We sure do. In _PyMalloc_Malloc, replace the line
This can't affect anything obtained via PyMem_{MALLOC, Malloc}, because they call the platform malloc() directly. We don't "wrap" them in any significant way.
return (void *)PyMem_MALLOC(nbytes);
with
void **result;
result = (void**)PyMem_MALLOC(nbytes+8); result[1] = NULL; return (result+2);
Yes, _PyMalloc_Malloc can do anything at all, but PyMem_{MALLOC, Malloc} don't go thru this code.
Likewise, replace
offset = (off_t )p & POOL_SIZE_MASK; pool = (poolp )((block *)p - offset); if (pool->pooladdr != pool || pool->magic != (uint )POOL_MAGIC) { PyMem_FREE(p); return; }
with
void **block = (void**)p; if(block[-1] == NULL){ PyMem_FREE(block-2); return; }
Something allocated via PyMem_{MALLOC, Malloc} that goes through this code will read random gibberish (or carefully constructed hostile gibberish) at block[-1]. So it's not safe for what I'm trying to accomplish.
As above, trying to hijack them is unattractive due to thread issues, even for just the PyMem_{Malloc, MALLOC, Realloc, REALLOC, New, NEW, Resize, RESIZE} spellings.
I can't see how this approach would add thread issues.
That's because it seems you're not trying to make PyMem_{MALLOC, Malloc} safe for mixed use at all. If you did, I suppose they could funnel through a wrapper other than _PyMalloc_Malloc (at the cost of other complications).
*If* we take over the system malloc, we need also to mimic the platform malloc's alignment.
I'm not sure what you mean by "take over".
I mean stop PyMem_XXX from calling platform malloc/realloc/free directly, so that they can be made safe for mixing with _PyMalloc_Free.
It will still be used as an allocator for arenas, and for large requests.
... Indeed, this approach might be slightly more expensive. In return, it is safe and (time-)efficient.
It's safe under some set of assumptions I'm not sure about, but not safe enough under the assumptions I am sure about <wink>.
Primarily that hijacking the system malloc is fraught with new difficulties.
That might be, but I'm not proposing this (or I don't understand the work "hijacking").
It means "take over" <wink>. If a scheme can't handle free'ing memory safely when that memory was returned directly from the platform malloc/realloc, it can't handle the sum of the things Guido and I want here (he's keener to preserve PyObject_XXX; I'm keener not to break existing code, even code that doesn't play by what some subset of Python-Dev'vers thinks might be "the rules").
... That safe-guard isn't needed, IMO. Returning memory allocated directly through malloc (and not via _PyMalloc_Malloc) to _PyMalloc_Free is clearly a bug,
Then you either have to change PyMem_{Malloc, MALLOC, etc etc etc} too, or you have to tell extension authors that every extension type they wrote for 1.5.2 will now fail in horrible ways (data-dependent and intermittent segfaults, data corruption, etc), and tell users that every "old" extension they pick up on the web may suffer such problems too even if it compiles cleanly for 2.3.
and there are more efficient ways to detect this kind of bug (like your pymalloc debug mechanism). In addition, there are so many other kinds of bugs that this doesn't protect against (like not taking the GC header into account when releasing memory) that this alone isn't convincing.
The difference is that I don't consider mixing PyObject_XXX with PyMem_XXX to be "a bug". In 1.5.2 such mixing was necessary, and still works fine in 2.2. I worry about extension authors who *do* bother to fix code that works fine <wink>.
Tim Peters <tim.one@comcast.net> writes:
I think you have to under this scheme, although I'm not clear on what "the problem" is you're trying to solve. I'm trying to cater to allocating via PyMem_{MALLOC, Malloc} then free'ing via _PyMalloc_Free.
Is *that* the motivation? I thought you were trying to accommodate your killer.py where Python code can trick pymalloc into believing that it owned the memory which it actually had allocated by forwarding to system malloc (see the subject). Regards, Martin
[Tim]
I'm trying to cater to allocating via PyMem_{MALLOC, Malloc} then free'ing via _PyMalloc_Free.
[Martin]
Is *that* the motivation?
One of them.
I thought you were trying to accommodate your killer.py where Python code can trick pymalloc into believing that it owned the memory which it actually had allocated by forwarding to system malloc
That's another one. There are only two, so we can stop now <wink>.
(see the subject).
That came from the first msg in the thread. The second msg in the thread came a couple days later, and broadened the scope at its start: http://mail.python.org/pipermail/python-dev/2002-March/021905.html It was too easy to *just* fix killer.py <wink>.
After digesting some ideas from David Abrahams offlist, I believe I may have a much simpler way to make a bulletproof "is or isn't this address from a pymalloc pool?" test. Described as a diff from current pymalloc: 1. Keep a contiguous vector of arena base addresses. This is not sorted. When a new arena is allocated, its base address is simply appended. Space required is proportional to the # of arenas in use (4 bytes/arena on a 32-bit box; 8 bytes/arena on a 64-bit box). 2. Remove the "magic number" gimmick from pool headers. 3. In place of the pooladr member of a pool header, add an arenaindex member. Every pool in an arena sets this to the index of its arena's base address, wrt the vector in #2. 4. To check an address p, find its pool header address just as now. Then read up arenaindex. If that's out of bounds for the #2 vector, it's not a pymalloc address. If it is in bounds, read the arena base address B out of the #2 vector, and see whether B <= p < B + 256KB (which can again be collapsed into a tricksy one-compare test via exploiting unsigned arithmetic). In #4, it's quite possible that non-pymalloc memory will lead to an in-bounds arenaindex -- but it can't pass the address test then. Limitation: if arenaindex is 32 bits, we can index at most 2**32 arenas, covering at most 2**32 arenas * 2**18 bytes/arena = 2**50 bytes on a 64-bit box. At that point we'd be devoting 2**32 arenas * 8 bytes-for-address/arena = 32 gigabtyes to the base-address vector. I assume the program will die long before this for some other good reason <wink>.
----- Original Message ----- From: "Tim Peters" <tim.one@comcast.net>
After digesting some ideas from David Abrahams offlist, I believe I may have a much simpler way to make a bulletproof "is or isn't this address from a pymalloc pool?" test. Described as a diff from current pymalloc:
1. Keep a contiguous vector of arena base addresses. This is not sorted. When a new arena is allocated, its base address is simply appended. Space required is proportional to the # of arenas in use (4 bytes/arena on a 32-bit box; 8 bytes/arena on a 64-bit box).
2. Remove the "magic number" gimmick from pool headers.
3. In place of the pooladr member of a pool header, add an arenaindex member. Every pool in an arena sets this to the index of its arena's base address, wrt the vector in #2.
4. To check an address p, find its pool header address just as now. Then read up arenaindex. If that's out of bounds for the #2 vector, it's not a pymalloc address. If it is in bounds, read the arena base address B out of the #2 vector
Fine so far...
, and see whether B <= p < B + 256KB (which can again be collapsed into a tricksy one-compare test via exploiting unsigned arithmetic).
...isn't this part just a little too complicated? If I understand correctly, arenas are 4K aligned pages. Given an address, when you find its pool header, you either find a valid arena header that covers all 4K subsequent addresses, or some alien memory. I think you just have to look for the address of the pool header at the appropriate index in the vector. IOW, there should be no need to look at the address you're deallocating after finding its putative arena. -Dave
"David Abrahams" <david.abrahams@rcn.com> writes:
...isn't this part just a little too complicated? If I understand correctly, arenas are 4K aligned pages.
No, arenas are 256k areas which are split into 4k pools.
Given an address, when you find its pool header, you either find a valid arena header that covers all 4K subsequent addresses, or some alien memory. I think you just have to look for the address of the pool header at the appropriate index in the vector. IOW, there should be no need to look at the address you're deallocating after finding its putative arena.
The pool in question and the arena it belongs to may have different starting addresses (in 63 of 64 cases, they will be different). Of course, you could also build the table for pool addresses, assigning pool indices - but that would require 64 times as much memory. Regards, Martin
----- Original Message ----- From: "Martin v. Loewis" <martin@v.loewis.de>
"David Abrahams" <david.abrahams@rcn.com> writes:
...isn't this part just a little too complicated? If I understand correctly, arenas are 4K aligned pages.
No, arenas are 256k areas which are split into 4k pools.
Given an address, when you find its pool header, you either find a valid arena header that covers all 4K subsequent addresses, or some alien memory. I think you just have to look for the address of the pool header at the appropriate index in the vector. IOW, there should be no need to look at the address you're deallocating after finding its putative arena.
The pool in question and the arena it belongs to may have different starting addresses (in 63 of 64 cases, they will be different).
Of course, you could also build the table for pool addresses, assigning pool indices - but that would require 64 times as much memory.
OK, but I guess my question still holds: can't you just round down to find a supposed arena address, look up the index, and see if that arena is in the vector? -Dave
"David Abrahams" <david.abrahams@rcn.com> writes:
OK, but I guess my question still holds: can't you just round down to find a supposed arena address, look up the index, and see if that arena is in the vector?
Arenas are not aligned on 256k boundaries. Instead, they are aligned on 8-byte boundaries (or whatever else the system malloc returns); the first up-to-four-k is wasted to align the first pool in the arena. So to find the arena header when given a pool header, you'd have to know the index of the pool in the arena, which would be more complicated than Tim's computation. Regards, Martin
[David Abrahams]
OK, but I guess my question still holds: can't you just round down to find a supposed arena address,
No, and this is the point at which you get stuck. The arena base address is whatever the heck libc malloc returned when we asked for 256KB. There's no way to deduce that from the addresses *within* the arena.
look up the index, and see if that arena is in the vector?
It would be possible to store both the arena index *and* the arena base address in each pool header. Then we could check that poolheader* p = pool_address(some_memory_address); if (p->arenaindex < narenas && arenas[p->arenaindex] == p->arenabase) { it's a pymalloc address } else { it isn't } I like that. Thanks!
[Tim]
It would be possible to store both the arena index *and* the arena base address in each pool header. Then we could check that
poolheader* p = pool_address(some_memory_address); if (p->arenaindex < narenas && arenas[p->arenaindex] == p->arenabase) { it's a pymalloc address } else { it isn't }
I like that.
No I don't: I hate that. It can be tricked.
[martin@v.loewis.de]
... The pool in question and the arena it belongs to may have different starting addresses (in 63 of 64 cases, they will be different).
Currently, they always have different addresses, as at least the first 4 bytes of an arena are used to link arenas together. If we save the arena base addresses in a vector instead, the link becomes unnecessary (well, it's not actually *used* for anything now anyway), and then the arena-rounding logic can be changed so that the first pool can coincide with the arena start if malloc just happens to return a page-aligned address for the arena (as the code is now, in that case the entire first page is devoted to holding s next-arena pointer).
Of course, you could also build the table for pool addresses, assigning pool indices - but that would require 64 times as much memory.
Bingo -- I don't want to do that.
[Tim]
, and see whether B <= p < B + 256KB (which can again be collapsed into a tricksy one-compare test via exploiting unsigned arithmetic).
[David Abrahams]
...isn't this part just a little too complicated?
I don't think so, but I am making a speed/space tradeoff.
If I understand correctly, arenas are 4K aligned pages.
Arenas are 256KB chunks with no known alignment. They're carved into 4KB-aligned pages (also called "pools") of 4KB each. Some number of the leading and trailing memory addresses in an arena are sacrificed to get page alignment in the smaller pieces (pools/pages). A given pool is in turn carved into some number of continguous, equal-sized, 8-byte aligned small blocks.
Given an address, when you find its pool header, you either find a valid arena header that covers all 4K subsequent addresses, or some alien memory.
True, after s/arena header/pool header/.
I think you just have to look for the address of the pool header at the appropriate index in the vector.
It would consume too much memory (64x as much -- 2**18/2**12) to keep a vector of all pool header addresses. That's why I'm storing arena base addresses instead. We can't control or predict anything about the addresses we get from the system malloc() when allocating arenas, so address arithmetic tricks can't work to find arena base addresses.
Arenas are 256KB chunks with no known alignment. They're carved into 4KB-aligned pages (also called "pools") of 4KB each. Some number of
From: "Tim Peters" <tim.one@comcast.net> the
leading and trailing memory addresses in an arena are sacrificed to get page alignment in the smaller pieces (pools/pages). A given pool is in turn carved into some number of continguous, equal-sized, 8-byte aligned small blocks.
Okay, 'nuf said. I see why you need to do it that way, now. -Dave
Tim Peters <tim.one@comcast.net> writes:
After digesting some ideas from David Abrahams offlist, I believe I may have a much simpler way to make a bulletproof "is or isn't this address from a pymalloc pool?" test. Described as a diff from current pymalloc:
This sounds quite good. The only flaw is that you need to trust that the machine has paged memory - otherwise, rounding down an arbitrary address to a 4k boundary, and then reading a value may cause an access violation (or may, say, fetch data from some peripheral device, thus starting the coffee machine :-). For the architectures we care about, this seems to be guaranteed; on other architectures, people will need to disable pymalloc. Regards, Martin
[martin@v.loewis.de]
This sounds quite good. The only flaw is that you need to trust that the machine has paged memory - otherwise, rounding down an arbitrary address to a 4k boundary, and then reading a value may cause an access violation (or may, say, fetch data from some peripheral device, thus starting the coffee machine :-).
pymalloc always relied on that assumption, so that part isn't new.
For the architectures we care about, this seems to be guaranteed; on other architectures, people will need to disable pymalloc.
Either that or buy a reasonable machine <wink>. A fair number of people have tried pymalloc by now, and nobody has reported a problem. I suppose they may have been lucky, and that feeding PyMem_{FREE, Free, DEL, Del} into PyMalloc_Free too may greatly strain their luck if so.
participants (6)
-
David Abrahams
-
Guido van Rossum
-
Kevin Jacobs
-
martin@v.loewis.de
-
Tim Peters
-
Tim Peters