Length hinting and preallocation for container types

Hello, today I came across this slides https://speakerdeck.com/alex/why-python-ruby-and-javascript-are-slow by Alex Gaynor. The slides aren't some random rants on Python. Alex makes some valid points. Coincidentally he is a PyPy developer, too. ;) One of his assertions is about memory (re)allocation. C is faster in some cases because C code usually does fewer allocations and reallocations. Python has no API to easily reallocate a list with 100 items. Code like lst = [] for i in range(100): lst.append(i*i) has to resize the list multiple times. PyPy has a feature to create a preallocated list. https://bitbucket.org/pypy/pypy/commits/2ff5e3c765ef/ Internally CPython already distinguishes between the length of object and the allocation size of an object for some types like list, dict and bytearray. For example PyListObject has `ob_size` for __len__ and `allocated` for the amount of available `ob_item` slots. I suggest that we add two new functions to container types like list, bytearray and dict. obj.__preallocate__(size) increases the internal buffer by size elements. obj.__shrink__() dwindles the internal buffer to len(obj) elements, maybe a bit more. A new context manager aids users with preallocation and shrinking: class LengthHint: def __init__(self, container, hint): self.container = container self.hint = hint self.instance = None def __enter__(self): self.instance = self.container() self.instance.__preallocate__(self.hint) return self.instance def __exit__(self, exc_type, exc_val, exc_tb): self.instance.__shrink__() with LengthHint(list, 200) as lst: # lst has 200 ob_item slots but len(lst) == 0 for i in range(100): lst.append(i*i) # __exit__ shrinks ob_item to 100 The C implementation is trivial as the three types already have all features. Christian

Le Tue, 05 Mar 2013 18:05:44 +0100, Christian Heimes <christian@python.org> a écrit :
Please see http://bugs.python.org/issue17338
So how about benchmark numbers? Regards Antoine.

Am 05.03.2013 18:10, schrieb Antoine Pitrou:
So how about benchmark numbers?
The speedup is smallish: $ ./python -m timeit -n1000 "l = []" "l.__preallocate__(10000)" "app = l.append" "for i in range(10000): app(i)" "l.__shrink__()" 1000 loops, best of 3: 3.68 msec per loop $ ./python -m timeit -n1000 "l = []" "app = l.append" "for i in range(10000): app(i)" 1000 loops, best of 3: 3.75 msec per loop

From: Christian Heimes <christian@python.org>
So, that's a 1.8% speedup. While doing things right gives a 20% speedup: $ python3.3 -m timeit -n1000 "l=[]" "app = l.append" "for i in range(10000): app(i)" 1000 loops, best of 3: 557 usec per loop $ python3.3 -m timeit -n1000 "l = [i for i in range(10000)]" 1000 loops, best of 3: 447 usec per loop Or (but obviously this isn't generally applicable): $ python3.3 -m timeit -n1000 "l = list(range(10000))" 1000 loops, best of 3: 236 usec per loop

On Tue, Mar 5, 2013 at 12:42 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
Yeah, I think it's pretty likely that any case where the proposed change would be applicable is probably going to be dominated by other factors more than by allocation overhead anyway.. The above does beg the question, though: Could we perhaps apply some of this thinking to list comprehensions, and speed them up even more by automatically taking hints from the inputs and preallocating the output list? (or do we already?) Not sure how much work that would be, or whether it would be worth it.. just a random thought.. --Alex

On Tue, Mar 5, 2013 at 2:03 PM, Alex Stewart <foogod@gmail.com> wrote:
This was proposed here: http://bugs.python.org/issue14126 -- Philip Jenvey

On Tue, Mar 5, 2013 at 9:05 AM, Christian Heimes <christian@python.org>wrote:
The real problem is that this code is not idiomatic Python, especially if you want it to be reasonably fast: lst = [] for i in range(100): lst.append(i*i) Why not: lst = [i*i for i in range(100)] If the "append" pattern is complex, just "preallocate" like this: lst = [0] * 100 And then fill it. Eli

tl;dr - I'm glad that PyPy exists. But IMHO Python shouldn't grow APIs to manage memory. Rather, compilers should take advantage of existing internal APIs to do a better job of automatic management, and compiler writers should suggest internal, not public, APIs where they would help the compiler writers. Eli Bendersky writes:
Not at all coincidentally. As a compiler writer (which he refers to in the slides several times) he is offended by poor performance, as measured in CPU/invocation. Good for him! I'm glad he's working on PyPy! But when compiler writers start talking language design for performance, we inevitably end up with C<0.5 wink/>.
And it shouldn't.<wink/> But why do you think allocation is slow in the general case? Sure, it involves system calls which indeed do slow things down. But there are ways to arrange that those calls are amortized (eg, preallocation of arenas, exponential reallocation for fast-growing objects). Yes, the few instructions it takes to grab a slab of memory off the heap add up. But is it worth programmer time to actually estimate correctly the size of the list in advance? What happens if the code is reused in an application where the lists are two orders of magnitude larger? Won't performance go in the tank? (According to your benchmark, I don't think anybody would ever notice.)
shouldn't be written.<wink/> As Eli writes:
The real problem is that this code is not idiomatic Python[...].
+as-many-as-I've-got-in-my-pocket. It's "for s in strings: s0 += s" in another guise. "We gots idioms fo' dat!"
Good. Are we done now?<wink/> Python has the necessary features, let the compiler writers take advantage of them. But exposing them to the user and requiring that the user do ugly and fragile things to get a minor speedup isn't Pythonic.
But that's ugly, especially the literal "200". The fact that Python doesn't splat explicit memory management in our T-shirts is one of the reasons why we use Python. And Python does grow extensions which optimize such patterns, beautiful features at that. Back to Eli:
[If] you want it to be reasonably fast [...] why not:
lst = [i*i for i in range(100)]
Again: Alex complains that "objects may be dicts, but dicts aren't objects". The performance point is that dicts are slow. True. But again Python (not all implementations?) has an optimization (though not pretty): objects can have slots. It also has a (potentially horrible!) pessimization: properties. The point is that object member access is often the pretty way to code. If it needs to be fast, we can do that. If it needs to be small or obey DRY, we can do that, too.

Le Wed, 06 Mar 2013 11:23:26 +0900, "Stephen J. Turnbull" <stephen@xemacs.org> a écrit :
I think that's a strong argument indeed.
It depends what one calls a system call. A memory allocation shouldn't always incur a call to the kernel. Library calls can be quite fast. Moreover, in the CPython case, there's also a custom allocator which handles all allocation requests smaller than 512 bytes. (I'm sure PyPy has their own allocator too) Regards Antoine.

Antoine Pitrou writes:
Thank you, but it's actually a fallacy (argumentem ad hominem). I think it's true that Alex's complaints and Christian's proposals are colored by that tendency, but it deserves discussion in terms of the proposals themselves. (Thus, the 0.5 wink. I maybe should have deleted it.)
If the process itself is growing, eventually it does. I don't know what best practice is, do libraries try to arrange for bounded time cost of allocation by default? And are best practices universally implemented? What about user tuning of the allocator that happens to be a pessimization for Python? As a cross-platform application, Python needs to worry about that. And I suspect that Alex (and Christian!) would consider saving a few CPU cycles in an inner loop very important. My point is that I don't consider that appropriate for user-visible language features in Python. I realize that performance-oriented applications will chafe at the bureaucracy necessary to get optimizations like comprehensions into the language, though. I just think it's a necessary sacrifice (but it's not my sacrifice, I admit).

Le Fri, 08 Mar 2013 08:16:59 +0900, "Stephen J. Turnbull" <stephen@xemacs.org> a écrit :
I think most allocators would request big chunks of memory from the kernel, and then carve out the small blocks requested by the user from that. Therefore, my intuition is that a long-running process, if not leaky, should end up not stressing mmap / sbrk system calls too much.
And I suspect that Alex (and Christian!) would consider saving a few CPU cycles in an inner loop very important.
They probably would, but that is not a design point for Python. Regards Antoine.

On Fri, Mar 8, 2013 at 5:07 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
I am a fan of the proposal. Imagine you are programming for a memory-constrained system. By telling the list how big it needs to be you can save precious RAM. It's a pretty standard feature to be able to hint and trim the size of data structures, just like you can usually choose the buffer size for stream operations.

On Fri, Mar 8, 2013 at 5:18 AM, Daniel Holth <dholth@gmail.com> wrote:
If it's voting time, I'm -1. Having programmed a lot of memory-constrained systems (not in Python, though) - this is not how things usually work there. In a memory-constrained system, you don't "grow and shrink" your data structures. That's because growing often needs to reallocate the whole chunk and do a copy, and shrinking only helps memory fragmentation. In such systems, you usually know in advance or at least limit the size of data structures and pre-allocate, which is perfectly possible in Python today. Shrinking is rarely, if ever, useful. If it is, you can implement concrete data structures for your concrete needs. And Python has a lot of ways to save memory for large arrays of things (array, numpy, encoding in bytes, etc) if one really wants to. I just don't believe the proposal will help in a lot of realistic code, and it certainly goes against the way of Python. Eli

On Sat, 09 Mar 2013 03:37:13 +1100 Steven D'Aprano <steve@pearwood.info> wrote:
Probably an accident of implementation, but I can't think of any reason to over-allocate here, so probably all implementations allocate exactly. Regards Antoine.

Le Fri, 8 Mar 2013 08:18:07 -0500, Daniel Holth <dholth@gmail.com> a écrit :
Is it an actual use case or are you just imagining it? :) I'm asking because, unless you are only allocating that list and all the objects contained it in it already exist, limiting the list's size won't do much for the process' memory occupation. Regards Antoine.

On Fri, Mar 8, 2013 at 6:04 AM, Daniel Holth <dholth@gmail.com> wrote:
That's not what you should use lists for if memory consumption matters. Use http://docs.python.org/dev/library/array.html, especially if your integers are in such a limited range. Eli

Le Fri, 08 Mar 2013 16:02:01 +0100, Stefan Behnel <stefan_ml@behnel.de> a écrit :
The way I read it, Daniel's message about small integers and 1-character strings was humorous. Obviously if you are memory-constrained you have better things to do than accumulating many one-character strings in a large Python list. Regards Antoine.

On Fri, Mar 8, 2013 at 10:02 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
IIUC the JIT is smart enough to give me a very efficient list of unboxed integers without having to change the type, increasing the Pythonicity of my program.

Daniel Holth, 08.03.2013 16:17:
It may or may not. It's a runtime optimiser, there's no guarantee that it will always perform "as expected". For example, it may decide to optimise your list for integer values up to 255, and when you add a value 256 for some reason, it may have to reallocate and copy the whole list. And when you remove the last 256 value from the list, there is no guarantee that it will shrink your list back to the optimal size, it may just keep wasting memory. Oh, and it may actually waste memory right from the start, by not optimising your list for values up to 255 but for values up to 2**31, although all you actually wanted to store was values between 1 and 99, right? It's always a good idea to put some thoughts into the choice of the right data structure for your use case. So, that being said, should we discuss extending this proposal to add a new API for Python lists that allows defining the maximum value of integer values that you want to store in them? That would allow for some serious optimisations. Stefan

Christian Heimes, 05.03.2013 18:05:
I just read through them. I'm ok with the first part, but when it comes to the "why it's really slow" section, I get the impression that Alex has misunderstood something about Python (and maybe programming languages in general). There's no need to make "dynamic languages" C-ish, and why should they be? We'll always be using more than one language for what we do, and that's neither good nor bad, just normal. For the few cases (more than 5%? Anyone?) where you really need to do zero-copy-whatever stuff or other low-level-I-really-know-what-I-am-doing-kind-of-things, you can just write them in a language that fits your use case and that allows you to do exactly the kind of zero-copy or bits-here-and-there or immutable data structures operations you need. That may or may not be C. It may be Fortran, it may be Lua, it may be Haskell, it may be Lisp. It depends on what you know and what you need. CPython has a well established reputation as an extremely good and easy to use integration platform, and that's its main selling point. Let's keep using it like that. Stefan

Le Tue, 05 Mar 2013 18:05:44 +0100, Christian Heimes <christian@python.org> a écrit :
Please see http://bugs.python.org/issue17338
So how about benchmark numbers? Regards Antoine.

Am 05.03.2013 18:10, schrieb Antoine Pitrou:
So how about benchmark numbers?
The speedup is smallish: $ ./python -m timeit -n1000 "l = []" "l.__preallocate__(10000)" "app = l.append" "for i in range(10000): app(i)" "l.__shrink__()" 1000 loops, best of 3: 3.68 msec per loop $ ./python -m timeit -n1000 "l = []" "app = l.append" "for i in range(10000): app(i)" 1000 loops, best of 3: 3.75 msec per loop

From: Christian Heimes <christian@python.org>
So, that's a 1.8% speedup. While doing things right gives a 20% speedup: $ python3.3 -m timeit -n1000 "l=[]" "app = l.append" "for i in range(10000): app(i)" 1000 loops, best of 3: 557 usec per loop $ python3.3 -m timeit -n1000 "l = [i for i in range(10000)]" 1000 loops, best of 3: 447 usec per loop Or (but obviously this isn't generally applicable): $ python3.3 -m timeit -n1000 "l = list(range(10000))" 1000 loops, best of 3: 236 usec per loop

On Tue, Mar 5, 2013 at 12:42 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
Yeah, I think it's pretty likely that any case where the proposed change would be applicable is probably going to be dominated by other factors more than by allocation overhead anyway.. The above does beg the question, though: Could we perhaps apply some of this thinking to list comprehensions, and speed them up even more by automatically taking hints from the inputs and preallocating the output list? (or do we already?) Not sure how much work that would be, or whether it would be worth it.. just a random thought.. --Alex

On Tue, Mar 5, 2013 at 2:03 PM, Alex Stewart <foogod@gmail.com> wrote:
This was proposed here: http://bugs.python.org/issue14126 -- Philip Jenvey

On Tue, Mar 5, 2013 at 9:05 AM, Christian Heimes <christian@python.org>wrote:
The real problem is that this code is not idiomatic Python, especially if you want it to be reasonably fast: lst = [] for i in range(100): lst.append(i*i) Why not: lst = [i*i for i in range(100)] If the "append" pattern is complex, just "preallocate" like this: lst = [0] * 100 And then fill it. Eli

tl;dr - I'm glad that PyPy exists. But IMHO Python shouldn't grow APIs to manage memory. Rather, compilers should take advantage of existing internal APIs to do a better job of automatic management, and compiler writers should suggest internal, not public, APIs where they would help the compiler writers. Eli Bendersky writes:
Not at all coincidentally. As a compiler writer (which he refers to in the slides several times) he is offended by poor performance, as measured in CPU/invocation. Good for him! I'm glad he's working on PyPy! But when compiler writers start talking language design for performance, we inevitably end up with C<0.5 wink/>.
And it shouldn't.<wink/> But why do you think allocation is slow in the general case? Sure, it involves system calls which indeed do slow things down. But there are ways to arrange that those calls are amortized (eg, preallocation of arenas, exponential reallocation for fast-growing objects). Yes, the few instructions it takes to grab a slab of memory off the heap add up. But is it worth programmer time to actually estimate correctly the size of the list in advance? What happens if the code is reused in an application where the lists are two orders of magnitude larger? Won't performance go in the tank? (According to your benchmark, I don't think anybody would ever notice.)
shouldn't be written.<wink/> As Eli writes:
The real problem is that this code is not idiomatic Python[...].
+as-many-as-I've-got-in-my-pocket. It's "for s in strings: s0 += s" in another guise. "We gots idioms fo' dat!"
Good. Are we done now?<wink/> Python has the necessary features, let the compiler writers take advantage of them. But exposing them to the user and requiring that the user do ugly and fragile things to get a minor speedup isn't Pythonic.
But that's ugly, especially the literal "200". The fact that Python doesn't splat explicit memory management in our T-shirts is one of the reasons why we use Python. And Python does grow extensions which optimize such patterns, beautiful features at that. Back to Eli:
[If] you want it to be reasonably fast [...] why not:
lst = [i*i for i in range(100)]
Again: Alex complains that "objects may be dicts, but dicts aren't objects". The performance point is that dicts are slow. True. But again Python (not all implementations?) has an optimization (though not pretty): objects can have slots. It also has a (potentially horrible!) pessimization: properties. The point is that object member access is often the pretty way to code. If it needs to be fast, we can do that. If it needs to be small or obey DRY, we can do that, too.

Le Wed, 06 Mar 2013 11:23:26 +0900, "Stephen J. Turnbull" <stephen@xemacs.org> a écrit :
I think that's a strong argument indeed.
It depends what one calls a system call. A memory allocation shouldn't always incur a call to the kernel. Library calls can be quite fast. Moreover, in the CPython case, there's also a custom allocator which handles all allocation requests smaller than 512 bytes. (I'm sure PyPy has their own allocator too) Regards Antoine.

Antoine Pitrou writes:
Thank you, but it's actually a fallacy (argumentem ad hominem). I think it's true that Alex's complaints and Christian's proposals are colored by that tendency, but it deserves discussion in terms of the proposals themselves. (Thus, the 0.5 wink. I maybe should have deleted it.)
If the process itself is growing, eventually it does. I don't know what best practice is, do libraries try to arrange for bounded time cost of allocation by default? And are best practices universally implemented? What about user tuning of the allocator that happens to be a pessimization for Python? As a cross-platform application, Python needs to worry about that. And I suspect that Alex (and Christian!) would consider saving a few CPU cycles in an inner loop very important. My point is that I don't consider that appropriate for user-visible language features in Python. I realize that performance-oriented applications will chafe at the bureaucracy necessary to get optimizations like comprehensions into the language, though. I just think it's a necessary sacrifice (but it's not my sacrifice, I admit).

Le Fri, 08 Mar 2013 08:16:59 +0900, "Stephen J. Turnbull" <stephen@xemacs.org> a écrit :
I think most allocators would request big chunks of memory from the kernel, and then carve out the small blocks requested by the user from that. Therefore, my intuition is that a long-running process, if not leaky, should end up not stressing mmap / sbrk system calls too much.
And I suspect that Alex (and Christian!) would consider saving a few CPU cycles in an inner loop very important.
They probably would, but that is not a design point for Python. Regards Antoine.

On Fri, Mar 8, 2013 at 5:07 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
I am a fan of the proposal. Imagine you are programming for a memory-constrained system. By telling the list how big it needs to be you can save precious RAM. It's a pretty standard feature to be able to hint and trim the size of data structures, just like you can usually choose the buffer size for stream operations.

On Fri, Mar 8, 2013 at 5:18 AM, Daniel Holth <dholth@gmail.com> wrote:
If it's voting time, I'm -1. Having programmed a lot of memory-constrained systems (not in Python, though) - this is not how things usually work there. In a memory-constrained system, you don't "grow and shrink" your data structures. That's because growing often needs to reallocate the whole chunk and do a copy, and shrinking only helps memory fragmentation. In such systems, you usually know in advance or at least limit the size of data structures and pre-allocate, which is perfectly possible in Python today. Shrinking is rarely, if ever, useful. If it is, you can implement concrete data structures for your concrete needs. And Python has a lot of ways to save memory for large arrays of things (array, numpy, encoding in bytes, etc) if one really wants to. I just don't believe the proposal will help in a lot of realistic code, and it certainly goes against the way of Python. Eli

On Sat, 09 Mar 2013 03:37:13 +1100 Steven D'Aprano <steve@pearwood.info> wrote:
Probably an accident of implementation, but I can't think of any reason to over-allocate here, so probably all implementations allocate exactly. Regards Antoine.

Le Fri, 8 Mar 2013 08:18:07 -0500, Daniel Holth <dholth@gmail.com> a écrit :
Is it an actual use case or are you just imagining it? :) I'm asking because, unless you are only allocating that list and all the objects contained it in it already exist, limiting the list's size won't do much for the process' memory occupation. Regards Antoine.

On Fri, Mar 8, 2013 at 6:04 AM, Daniel Holth <dholth@gmail.com> wrote:
That's not what you should use lists for if memory consumption matters. Use http://docs.python.org/dev/library/array.html, especially if your integers are in such a limited range. Eli

Le Fri, 08 Mar 2013 16:02:01 +0100, Stefan Behnel <stefan_ml@behnel.de> a écrit :
The way I read it, Daniel's message about small integers and 1-character strings was humorous. Obviously if you are memory-constrained you have better things to do than accumulating many one-character strings in a large Python list. Regards Antoine.

On Fri, Mar 8, 2013 at 10:02 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
IIUC the JIT is smart enough to give me a very efficient list of unboxed integers without having to change the type, increasing the Pythonicity of my program.

Daniel Holth, 08.03.2013 16:17:
It may or may not. It's a runtime optimiser, there's no guarantee that it will always perform "as expected". For example, it may decide to optimise your list for integer values up to 255, and when you add a value 256 for some reason, it may have to reallocate and copy the whole list. And when you remove the last 256 value from the list, there is no guarantee that it will shrink your list back to the optimal size, it may just keep wasting memory. Oh, and it may actually waste memory right from the start, by not optimising your list for values up to 255 but for values up to 2**31, although all you actually wanted to store was values between 1 and 99, right? It's always a good idea to put some thoughts into the choice of the right data structure for your use case. So, that being said, should we discuss extending this proposal to add a new API for Python lists that allows defining the maximum value of integer values that you want to store in them? That would allow for some serious optimisations. Stefan

Christian Heimes, 05.03.2013 18:05:
I just read through them. I'm ok with the first part, but when it comes to the "why it's really slow" section, I get the impression that Alex has misunderstood something about Python (and maybe programming languages in general). There's no need to make "dynamic languages" C-ish, and why should they be? We'll always be using more than one language for what we do, and that's neither good nor bad, just normal. For the few cases (more than 5%? Anyone?) where you really need to do zero-copy-whatever stuff or other low-level-I-really-know-what-I-am-doing-kind-of-things, you can just write them in a language that fits your use case and that allows you to do exactly the kind of zero-copy or bits-here-and-there or immutable data structures operations you need. That may or may not be C. It may be Fortran, it may be Lua, it may be Haskell, it may be Lisp. It depends on what you know and what you need. CPython has a well established reputation as an extremely good and easy to use integration platform, and that's its main selling point. Let's keep using it like that. Stefan
participants (10)
-
Alex Stewart
-
Andrew Barnert
-
Antoine Pitrou
-
Christian Heimes
-
Daniel Holth
-
Eli Bendersky
-
Philip Jenvey
-
Stefan Behnel
-
Stephen J. Turnbull
-
Steven D'Aprano