[numpy wishlist] Interpreter support for temporary elision in third-party classes
Hi all, There's a very valuable optimization -- temporary elision -- which numpy can *almost* do. It gives something like a 10-30% speedup for lots of common real-world expressions. It would probably would be useful for non-numpy code too. (In fact it generalizes the str += str special case that's currently hardcoded in ceval.c.) But it can't be done safely without help from the interpreter, and possibly not even then. So I thought I'd raise it here and see if we can get any consensus on whether and how CPython could support this. === The dream === Here's the idea. Take an innocuous expression like: result = (a + b + c) / c This gets evaluated as: tmp1 = a + b tmp2 = tmp1 + c result = tmp2 / c All these temporaries are very expensive. Suppose that a, b, c are arrays with N bytes each, and N is large. For simple arithmetic like this, then costs are dominated by memory access. Allocating an N byte array requires the kernel to clear the memory, which incurs N bytes of memory traffic. If all the operands are already allocated, then performing a three-operand operation like tmp1 = a + b involves 3N bytes of memory traffic (reading the two inputs plus writing the output). In total our example does 3 allocations and has 9 operands, so it does 12N bytes of memory access. If our arrays are small, then the kernel doesn't get involved and some of these accesses will hit the cache, but OTOH the overhead of things like malloc won't be amortized out; the best case starting from a cold cache is 3 mallocs and 6N bytes worth of cache misses (or maybe 5N if we get lucky and malloc'ing 'result' returns the same memory that tmp1 used, and it's still in cache). There's an obvious missed optimization in this code, though, which is that it keeps allocating new temporaries and throwing away old ones. It would be better to just allocate a temporary once and re-use it: tmp1 = a + b tmp1 += c tmp1 /= c result = tmp1 Now we have only 1 allocation and 7 operands, so we touch only 8N bytes of memory. For large arrays -- that don't fit into cache, and for which per-op overhead is amortized out -- this gives a theoretical 33% speedup, and we can realistically get pretty close to this. For smaller arrays, the re-use of tmp1 means that in the best case we have only 1 malloc and 4N bytes worth of cache misses, and we also have a smaller cache footprint, which means this best case will be achieved more often in practice. For small arrays it's harder to estimate the total speedup here, but 66% fewer mallocs and 33% fewer cache misses is certainly enough to make a practical difference. Such optimizations are important enough that numpy operations always give the option of explicitly specifying the output array (like in-place operators but more general and with clumsier syntax). Here's an example small-array benchmark that IIUC uses Jacobi iteration to solve Laplace's equation. It's been written in both natural and hand-optimized formats (compare "num_update" to "num_inplace"): https://yarikoptic.github.io/numpy-vbench/vb_vb_app.html#laplace-inplace num_inplace is totally unreadable, but because we've manually elided temporaries, it's 10-15% faster than num_update. With our prototype automatic temporary elision turned on, this difference disappears -- the natural code gets 10-15% faster, *and* we remove the temptation to write horrible things like num_inplace. What do I mean by "automatic temporary elision"? It's *almost* possible for numpy to automatically convert the first example into the second. The idea is: we want to replace tmp2 = tmp1 + c with tmp1 += c tmp2 = tmp1 And we can do this by defining def __add__(self, other): if is_about_to_be_thrown_away(self): return self.__iadd__(other) else: ... now tmp1.__add__(c) does an in-place add and returns tmp1, no allocation occurs, woohoo. The only little problem is implementing is_about_to_be_thrown_away(). === The sneaky-but-flawed approach === The following implementation may make you cringe, but it comes tantalizingly close to working: bool is_about_to_be_thrown_away(PyObject * obj) { return (Py_REFCNT(obj) == 1); } In fact, AFAICT it's 100% correct for libraries being called by regular python code (which is why I'm able to quote benchmarks at you :-)). The bytecode eval loop always holds a reference to all operands, and then immediately DECREFs them after the operation completes. If one of our arguments has no other references besides this one, then we can be sure that it is a dead obj walking, and steal its corpse. But this has a fatal flaw: people are unreasonable creatures, and sometimes they call Python libraries without going through ceval.c :-(. It's legal for random C code to hold an array object with a single reference count, and then call PyNumber_Add on it, and then expect the original array object to still be valid. But who writes code like that in practice? Well, Cython does. So, this is no-go. === A better (?) approach === This is a pretty arcane bit of functionality that we need, and it interacts with ceval.c, so I'm not at all confident about the best way to do it. (We even have an implementation using libunwind to walk the C stack and make sure that we're being called from ceval.c, which... works, actually, but is unsatisfactory in other ways.) I do have an idea that I *think* might work and be acceptable, but you tell me: Proposal: We add an API call PyEval_LastOpDefinitelyMatches(frame, optype, *args) which checks whether the last instruction executed in 'frame' was in fact an 'optype' instruction and did in fact have arguments 'args'. If it was, then it returns True. If it wasn't, or if we aren't sure, it returns False. The intention is that 'optype' is a semantic encoding of the instruction (like "+" or "function call") and thus can be preserved even if the bytecode details change. Then, in numpy's __add__, we do: 1) fetch the current stack frame from TLS 2) check PyEval_LastOpDefinitelyMatches(frame, "+", arg1, arg2) 3) check for arguments with refcnt == 1 4) check that all arguments are base-class numpy array objects (i.e., PyArray_CheckExact) The logic here is that step (2) tells us that someone did 'arg1 + arg2', so ceval.c is holding a temporary reference to the arguments, and step (3) tells us that at the time of the opcode evaluation there were no other references to these arguments, and step (4) tells us that 'arg1 + arg2' dispatched directly to ndarray.__add__ so there's no chance that anyone else has borrowed a reference in the mean time. AFAICT PyEval_LastOpDefinitelyMatches can *almost* be implemented now; the only problem is that stack_pointer is a local variable in PyEval_EvalFrameEx, and we would need it to be accessible via the frame object. The easy way would be to just move it in there. I don't know if this would have any weird effects on speed due to cache effects, but I guess we could arrange to put it into the same cache line as f_lasti, which is also updated on every opcode? OTOH someone has gone to some trouble to make sure that f_stacktop usually *doesn't* point to the top of the stack, and I guess there must have been some reason for this. Alternatively we could stash a pointer to stack_pointer in the frame object, and that would only need to be updated once per entry/exit to PyEval_EvalFrameEx. Obviously there are a lot of details to work out here, like what the calling convention for PyEval_LastOpDefinitelyMatches should really be, but: * Does this approach seem like it would successfully solve the problem? * Does this approach seem like it would be acceptable in CPython? * Is there a better idea I'm missing? -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
On 5 June 2014 21:51, Nathaniel Smith
Is there a better idea I'm missing?
Just a thought, but the temporaries come from the stack manipulation done by the likes of the BINARY_ADD opcode. (After all the bytecode doesn't use temporaries, it's a stack machine). Maybe BINARY_ADD and friends could allow for an alternative fast calling convention for __add__implementations that uses the stack slots directly? This may be something that's only plausible from C code, though. Or may not be plausible at all. I haven't looked at ceval.c for many years... If this is an insane idea, please feel free to ignore me :-) Paul
On Thu, Jun 5, 2014 at 10:37 PM, Paul Moore
On 5 June 2014 21:51, Nathaniel Smith
wrote: Is there a better idea I'm missing?
Just a thought, but the temporaries come from the stack manipulation done by the likes of the BINARY_ADD opcode. (After all the bytecode doesn't use temporaries, it's a stack machine). Maybe BINARY_ADD and friends could allow for an alternative fast calling convention for __add__implementations that uses the stack slots directly? This may be something that's only plausible from C code, though. Or may not be plausible at all. I haven't looked at ceval.c for many years...
If this is an insane idea, please feel free to ignore me :-)
To make sure I understand correctly, you're suggesting something like adding a new set of special method slots, __te_add__, __te_mul__, etc., which BINARY_ADD and friends would check for and if found, dispatch to without going through PyNumber_Add? And this way, a type like numpy's array could have a special implementation for __te_add__ that works the same as __add__, except with the added wrinkle that it knows that it will only be called by the interpreter and thus any arguments with refcnt 1 must be temporaries? -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
On 5 June 2014 22:47, Nathaniel Smith
To make sure I understand correctly, you're suggesting something like adding a new set of special method slots, __te_add__, __te_mul__, etc.
I wasn't thinking in that much detail, TBH. I'm not sure adding a whole set of new slots is sensible for such a specialised case. I think I was more assuming that the special method implementations could use an alternative calling convention, METH_STACK in place of METH_VARARGS, for example. That would likely only be viable for types implemented in C. But either way, it may be more complicated than the advantages would justify... Paul
On Thu, Jun 5, 2014 at 11:12 PM, Paul Moore
On 5 June 2014 22:47, Nathaniel Smith
wrote: To make sure I understand correctly, you're suggesting something like adding a new set of special method slots, __te_add__, __te_mul__, etc.
I wasn't thinking in that much detail, TBH. I'm not sure adding a whole set of new slots is sensible for such a specialised case. I think I was more assuming that the special method implementations could use an alternative calling convention, METH_STACK in place of METH_VARARGS, for example. That would likely only be viable for types implemented in C.
But either way, it may be more complicated than the advantages would justify...
Oh, I see, that's clever. But, unfortunately most __special__ methods at the C level don't use METH_*, they just have hard-coded calling conventions: https://docs.python.org/3/c-api/typeobj.html#number-structs -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
On 6/5/2014 4:51 PM, Nathaniel Smith wrote:
In fact, AFAICT it's 100% correct for libraries being called by regular python code (which is why I'm able to quote benchmarks at you :-)). The bytecode eval loop always holds a reference to all operands, and then immediately DECREFs them after the operation completes. If one of our arguments has no other references besides this one, then we can be sure that it is a dead obj walking, and steal its corpse.
But this has a fatal flaw: people are unreasonable creatures, and sometimes they call Python libraries without going through ceval.c :-(. It's legal for random C code to hold an array object with a single reference count, and then call PyNumber_Add on it, and then expect the original array object to still be valid. But who writes code like that in practice? Well, Cython does. So, this is no-go.
I understand that a lot of numpy/scipy code is compiled with Cython, so you really want the optimization to continue working when so compiled. Is there a simple change to Cython that would work, perhaps in coordination with a change to numpy? Is so, you could get the result before 3.5 comes out. I realized that there are other compilers than Cython and non-numpy code that could benefit, so that a more generic solution would also be good. In particular
Here's the idea. Take an innocuous expression like:
result = (a + b + c) / c
This gets evaluated as:
tmp1 = a + b tmp2 = tmp1 + c result = tmp2 / c ... There's an obvious missed optimization in this code, though, which is that it keeps allocating new temporaries and throwing away old ones. It would be better to just allocate a temporary once and re-use it: tmp1 = a + b tmp1 += c tmp1 /= c result = tmp1
Could this transformation be done in the ast? And would that help? A prolonged discussion might be better on python-ideas. See what others say. -- Terry Jan Reedy
On 5 Jun 2014 23:58, "Terry Reedy"
On 6/5/2014 4:51 PM, Nathaniel Smith wrote:
In fact, AFAICT it's 100% correct for libraries being called by regular python code (which is why I'm able to quote benchmarks at you :-)). The bytecode eval loop always holds a reference to all operands, and then immediately DECREFs them after the operation completes. If one of our arguments has no other references besides this one, then we can be sure that it is a dead obj walking, and steal its corpse.
But this has a fatal flaw: people are unreasonable creatures, and sometimes they call Python libraries without going through ceval.c :-(. It's legal for random C code to hold an array object with a single reference count, and then call PyNumber_Add on it, and then expect the original array object to still be valid. But who writes code like that in practice? Well, Cython does. So, this is no-go.
I understand that a lot of numpy/scipy code is compiled with Cython, so
I realized that there are other compilers than Cython and non-numpy code
you really want the optimization to continue working when so compiled. Is there a simple change to Cython that would work, perhaps in coordination with a change to numpy? Is so, you could get the result before 3.5 comes out. Unfortunately we don't actually know whether Cython is the only culprit (such code *could* be written by hand), and even if we fixed Cython it would take some unknowable amount of time before all downstream users upgraded their Cythons. (It's pretty common for projects to check in Cython-generated .c files, and only regenerate when the Cython source actually gets modified.) Pretty risky for an optimization. that could benefit, so that a more generic solution would also be good. In particular
Here's the idea. Take an innocuous expression like:
result = (a + b + c) / c
This gets evaluated as:
tmp1 = a + b tmp2 = tmp1 + c result = tmp2 / c ...
There's an obvious missed optimization in this code, though, which is that it keeps allocating new temporaries and throwing away old ones. It would be better to just allocate a temporary once and re-use it: tmp1 = a + b tmp1 += c tmp1 /= c result = tmp1
Could this transformation be done in the ast? And would that help?
I don't think it could be done in the ast because I don't think you can work with anonymous temporaries there. But, now that you mention it, it could be done on the fly in the implementation of the relevant opcodes. I.e., BIN_ADD could do if (Py_REFCNT(left) == 1) result = PyNumber_InPlaceAdd(left, right); else result = PyNumber_Add(left, right) Upside: all packages automagically benefit! Potential downsides to consider: - Subtle but real and user-visible change in Python semantics. I'd be a little nervous about whether anyone has implemented, say, an iadd with side effects such that you can tell whether a copy was made, even if the object being copied is immediately destroyed. Maybe this doesn't make sense though. - Only works when left operand is the temporary ("remember that a*b+c is faster than c+a*b"), and only for arithmetic (no benefit for np.sin(a + b)). Probably does cover the majority of cases though.
A prolonged discussion might be better on python-ideas. See what others say.
Yeah, I wasn't sure which list to use for this one, happy to move if it would work better. -n
On Fri, Jun 6, 2014 at 11:47 AM, Nathaniel Smith
Unfortunately we don't actually know whether Cython is the only culprit (such code *could* be written by hand), and even if we fixed Cython it would take some unknowable amount of time before all downstream users upgraded their Cythons. (It's pretty common for projects to check in Cython-generated .c files, and only regenerate when the Cython source actually gets modified.) Pretty risky for an optimization.
But code will still work, right? I mean, you miss out on an optimization, but it won't actually be wrong code? It should be possible to say "After upgrading to Cython version x.y, regenerate all your .c files to take advantage of this new optimization". ChrisA
Nathaniel Smith wrote:
I.e., BIN_ADD could do
if (Py_REFCNT(left) == 1) result = PyNumber_InPlaceAdd(left, right); else result = PyNumber_Add(left, right)
Upside: all packages automagically benefit!
Potential downsides to consider: - Subtle but real and user-visible change in Python semantics.
That would be a real worry. Even if such cases were rare, they'd be damnably difficult to debug when they did occur. I think for safety's sake this should only be done if the type concerned opts in somehow, perhaps by a tp_flag indicating that the type is eligible for temporary elision. -- Greg
Nathaniel Smith wrote:
I'd be a little nervous about whether anyone has implemented, say, an iadd with side effects such that you can tell whether a copy was made, even if the object being copied is immediately destroyed.
I can think of at least one plausible scenario where this could occur: the operand is a view object that wraps another object, and its __iadd__ method updates that other object. In fact, now that I think about it, exactly this kind of thing happens in numpy when you slice an array! So the opt-in indicator would need to be dynamic, on a per-object basis, rather than a type flag. -- Greg
On 06.06.2014 04:26, Greg Ewing wrote:
Nathaniel Smith wrote:
I'd be a little nervous about whether anyone has implemented, say, an iadd with side effects such that you can tell whether a copy was made, even if the object being copied is immediately destroyed.
I can think of at least one plausible scenario where this could occur: the operand is a view object that wraps another object, and its __iadd__ method updates that other object.
In fact, now that I think about it, exactly this kind of thing happens in numpy when you slice an array!
So the opt-in indicator would need to be dynamic, on a per-object basis, rather than a type flag.
yes an opt-in indicator would need to receive both operand objects so it would need to be a slot in the object or number type object. Would the addition of a tp_can_elide slot to the object types be acceptable for this rather specialized case? tp_can_elide receives two objects and returns one of three values: * can work inplace, operation is associative * can work inplace but not associative * cannot work inplace Implementation could e.g. look about like this: TARGET(BINARY_SUBTRACT) { fl = left->obj_type->tp_can_elide fr = right->obj_type->tp_can_elide elide = 0 if (unlikely(fl)) { elide = fl(left, right) } else if (unlikely(fr)) { elide = fr(left, right) } if (unlikely(elide == YES) && left->refcnt == 1) { PyNumber_InPlaceSubtract(left, right) } else if (unlikely(elide == SWAPPABLE) && right->refcnt == 1) { PyNumber_InPlaceSubtract(right, left) } else { PyNumber_Subtract(left, right) } }
Julian Taylor wrote:
tp_can_elide receives two objects and returns one of three values: * can work inplace, operation is associative * can work inplace but not associative * cannot work inplace
Does it really need to be that complicated? Isn't it sufficient just to ask the object potentially being overwritten whether it's okay to overwrite it? I.e. a parameterless method returning a boolean. -- Greg
Greg Ewing
Julian Taylor wrote:
tp_can_elide receives two objects and returns one of three values: * can work inplace, operation is associative * can work inplace but not associative * cannot work inplace
Does it really need to be that complicated? Isn't it sufficient just to ask the object potentially being overwritten whether it's okay to overwrite it?
How can it know this without help from the interpreter? Sturla
On Sat, Jun 7, 2014 at 1:37 AM, Greg Ewing
Julian Taylor wrote:
tp_can_elide receives two objects and returns one of three values: * can work inplace, operation is associative * can work inplace but not associative * cannot work inplace
Does it really need to be that complicated? Isn't it sufficient just to ask the object potentially being overwritten whether it's okay to overwrite it? I.e. a parameterless method returning a boolean.
For the numpy case, we really need to see all the operands, *and* know what the operation in question is. Consider tmp1 = np.ones((3, 1)) tmp2 = np.ones((1, 3)) tmp1 + tmp2 which returns an array with shape (3, 3). Both input arrays are temporaries, but neither of them can be stolen to use for the output array. Or suppose 'a' is an array of integers and 'b' is an array of floats, then 'a + b' and 'a += b' have very different results (the former upcasts 'a' to float, the latter has to either downcast 'b' to int or raise an error). But the casting rules depend on the particular input types and the particular operation -- operations like & and << want to cast to int, < and > return bools, etc. So one really needs to know all the details of the operation before one can determine whether temporary elision is possible. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
Nathaniel Smith wrote:
For the numpy case, we really need to see all the operands, *and* know what the operation in question is...
Okay, I see what you mean now. Given all that, it might be simpler just to have the method perform the operation itself if it can. It has all the information necessary to do so, after all. This would also make it possible for the inplace operators to have different semantics from temp-elided non-inplace ones if desired. -- Greg
Nathaniel Smith
tmp1 = a + b tmp1 += c tmp1 /= c result = tmp1
Could this transformation be done in the ast? And would that help?
I don't think it could be done in the ast because I don't think you can work with anonymous temporaries there. But, now that you mention it, it could be done on the fly in the implementation of the relevant opcodes. I.e., BIN_ADD could do
if (Py_REFCNT(left) == 1) result = PyNumber_InPlaceAdd(left, right); else result = PyNumber_Add(left, right)
Upside: all packages automagically benefit!
Potential downsides to consider: - Subtle but real and user-visible change in Python semantics. I'd be a little nervous about whether anyone has implemented, say, an iadd with side effects such that you can tell whether a copy was made, even if the object being copied is immediately destroyed. Maybe this doesn't make sense though.
Hmm. I don't think this is as unlikely as it may sound. Consider eg the h5py module: with h5py.File('database.h5') as fh: result = fh['key'] + np.ones(42) if this were transformed to with h5py.File('database.h5') as fh: tmp = fh['key'] tmp += np.ones(42) result = tmp then the database.h5 file would get modified, *and* result would be of type h5py.Dataset rather than np.array. Best, -Nikolaus -- GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.«
Nathaniel Smith
Such optimizations are important enough that numpy operations always give the option of explicitly specifying the output array (like in-place operators but more general and with clumsier syntax). Here's an example small-array benchmark that IIUC uses Jacobi iteration to solve Laplace's equation. It's been written in both natural and hand-optimized formats (compare "num_update" to "num_inplace"):
https://yarikoptic.github.io/numpy-vbench/vb_vb_app.html#laplace-inplace
num_inplace is totally unreadable, but because we've manually elided temporaries, it's 10-15% faster than num_update.
Does it really have to be that ugly? Shouldn't using tmp += u[2:,1:-1] tmp *= dy2 instead of np.add(tmp, u[2:,1:-1], out=tmp) np.multiply(tmp, dy2, out=tmp) give the same performance? (yes, not as nice as what you're proposing, but I'm still curious). Best, -Nikolaus -- GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.«
On 6 Jun 2014 02:16, "Nikolaus Rath"
Nathaniel Smith
writes: Such optimizations are important enough that numpy operations always give the option of explicitly specifying the output array (like in-place operators but more general and with clumsier syntax). Here's an example small-array benchmark that IIUC uses Jacobi iteration to solve Laplace's equation. It's been written in both natural and hand-optimized formats (compare "num_update" to "num_inplace"):
https://yarikoptic.github.io/numpy-vbench/vb_vb_app.html#laplace-inplace
num_inplace is totally unreadable, but because we've manually elided temporaries, it's 10-15% faster than num_update.
Does it really have to be that ugly? Shouldn't using
tmp += u[2:,1:-1] tmp *= dy2
instead of
np.add(tmp, u[2:,1:-1], out=tmp) np.multiply(tmp, dy2, out=tmp)
give the same performance? (yes, not as nice as what you're proposing, but I'm still curious).
Yes, only the last line actually requires the out= syntax, everything else could use in place operators instead (and automatic temporary elision wouldn't work for the last line anyway). I guess whoever wrote it did it that way for consistency (and perhaps in hopes of eking out a tiny bit more speed - in numpy currently the in-place operators are implemented by dispatching to function calls like those). Not sure how much difference it really makes in practice though. It'd still be 8 statements and two named temporaries to do the work of one infix expression, with order of operations implicit. -n
On 05/06/14 22:51, Nathaniel Smith wrote:
This gets evaluated as:
tmp1 = a + b tmp2 = tmp1 + c result = tmp2 / c
All these temporaries are very expensive. Suppose that a, b, c are arrays with N bytes each, and N is large. For simple arithmetic like this, then costs are dominated by memory access. Allocating an N byte array requires the kernel to clear the memory, which incurs N bytes of memory traffic.
It seems to be the case that a large portion of the run-time in Python code using NumPy can be spent in the kernel zeroing pages (which the kernel does for security reasons). I think this can also be seen as a 'malloc problem'. It comes about because each new NumPy array starts with a fresh buffer allocated by malloc. Perhaps buffers can be reused? Sturla
On 06.06.2014 04:18, Sturla Molden wrote:
On 05/06/14 22:51, Nathaniel Smith wrote:
This gets evaluated as:
tmp1 = a + b tmp2 = tmp1 + c result = tmp2 / c
All these temporaries are very expensive. Suppose that a, b, c are arrays with N bytes each, and N is large. For simple arithmetic like this, then costs are dominated by memory access. Allocating an N byte array requires the kernel to clear the memory, which incurs N bytes of memory traffic.
It seems to be the case that a large portion of the run-time in Python code using NumPy can be spent in the kernel zeroing pages (which the kernel does for security reasons).
I think this can also be seen as a 'malloc problem'. It comes about because each new NumPy array starts with a fresh buffer allocated by malloc. Perhaps buffers can be reused?
Sturla
Caching memory inside of numpy would indeed solve this issue too. There has even been a paper written on this which contains some more serious benchmarks than the laplace case which runs on very old hardware (and the inplace and out of place cases are actually not the same, one computes array/scalar the other array * (1 / scalar)): hiperfit.dk/pdf/Doubling.pdf "The result is an improvement of as much as 2.29 times speedup, on average 1.32 times speedup across a benchmark suite of 15 applications" The problem with this approach is that it is already difficult enough to handle memory in numpy. Having a cache that potentially stores gigabytes of memory out of the users sight will just make things worse. This would not be needed if we can come up with a way on how python can help out numpy in eliding the temporaries.
Julian Taylor
The problem with this approach is that it is already difficult enough to handle memory in numpy.
I would not do this in a way that complicates memory management in NumPy. I would just replace malloc and free with temporarily cached versions. From the perspective of NumPy the API should be the same.
Having a cache that potentially stores gigabytes of memory out of the users sight will just make things worse.
Buffer don't need to stay in cache forver, just long enough to allow resue within an expression. We are probably talking about delaying the call to free with just a few microseconds. We could e.g. have a setup like this: NumPy thread on "malloc": - tries to grab memory off the internal heap - calls system malloc on failure NumPy thread on "free": - returns a buffer to the internal heap - signals a condition Background daemonic GC thread: - wakes after sleeping on the condition - sleeps for another N microseconds (N = magic number) - flushes or shrinks the internal heap with system free - goes back to sleeping on the condition It can be implemented with the same API as malloc and free, and plugged directly into the existing NumPy code. We would in total need two mutexes, one condition variable, a pthread, and a heap. Sturla
On 6 Jun 2014 17:07, "Sturla Molden"
We would in total need two mutexes, one condition variable, a pthread, and a heap.
The proposal in my initial email requires zero pthreads, and is substantially more effective. (Your proposal reduces only the alloc overhead for large arrays; mine reduces both alloc and memory access overhead for boyh large and small arrays.) -n
Nathaniel Smith
The proposal in my initial email requires zero pthreads, and is substantially more effective. (Your proposal reduces only the alloc overhead for large arrays; mine reduces both alloc and memory access overhead for boyh large and small arrays.)
My suggestion prevents the kernel from zeroing pages in the middle of a computation, which is an important part. It would also be an optimiation the Python interpreter could benefit from indepently of NumPy, by allowing reuse of allocated memory pages within CPU bound portions of the Python code. And no, the method I suggested does not only work for large arrays. If we really want to take out the memory access overhead, we need to consider lazy evaluation. E.g. a context manager that collects a symbolic expression and triggers evaluation on exit: with numpy.accelerate: x = <expression> y = <expression> z = <expression> # evaluation of x,y,z happens here Sturla
On Fri, Jun 6, 2014 at 11:33 PM, Sturla Molden
Nathaniel Smith
wrote: The proposal in my initial email requires zero pthreads, and is substantially more effective. (Your proposal reduces only the alloc overhead for large arrays; mine reduces both alloc and memory access overhead for boyh large and small arrays.)
My suggestion prevents the kernel from zeroing pages in the middle of a computation, which is an important part. It would also be an optimiation the Python interpreter could benefit from indepently of NumPy, by allowing reuse of allocated memory pages within CPU bound portions of the Python code. And no, the method I suggested does not only work for large arrays.
Small allocations are already recycled within process and don't touch the kernel, so your method doesn't affect them at all. My guess is that PyMalloc is unlikely to start spawning background threads any time soon, but if you'd like to propose it maybe you should start a new thread for that?
If we really want to take out the memory access overhead, we need to consider lazy evaluation. E.g. a context manager that collects a symbolic expression and triggers evaluation on exit:
with numpy.accelerate: x = <expression> y = <expression> z = <expression> # evaluation of x,y,z happens here
Using an alternative evaluation engine is indeed another way to optimize execution, which is why projects like numexpr, numba, theano, etc. exist. But this is basically switching to a different language in a different VM. I think people will keep running plain-old-CPython code for some time yet, and the point of this thread is that there's some low-hanging fruit for making all *that* code faster. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
Nathaniel Smith
with numpy.accelerate: x = <expression> y = <expression> z = <expression> # evaluation of x,y,z happens here
Using an alternative evaluation engine is indeed another way to optimize execution, which is why projects like numexpr, numba, theano, etc. exist. But this is basically switching to a different language in a different VM.
I was not thinking that complicated. Let us focus on what an unmodified CPython can do. A compound expression with arrays can also be seen as a pipeline. Imagine what would happen if in "NumPy 2.0" arithmetic operators returned coroutines instead of temporary arrays. That way an expression could be evaluated chunkwise, and the chunks would be small enough to fit in cache. Sturla
participants (8)
-
Chris Angelico
-
Greg Ewing
-
Julian Taylor
-
Nathaniel Smith
-
Nikolaus Rath
-
Paul Moore
-
Sturla Molden
-
Terry Reedy