Cross-post from python-ideas: Compressing the stack on the fly

Hi guys, I made a post on the Python-ideas mailing list that I was told might be relevant to Pypy. I've reproduced the original email below. Here is the thread on Python-ideas with all the discussion.<https://groups.google.com/forum/?fromgroups#!topic/python-ideas/hteGSNTyC_4> -------------- Hi everybody, Here's an idea I had a while ago. Now, I'm an ignoramus when it comes to how programming languages are implemented, so this idea will most likely be either (a) completely impossible or (b) trivial knowledge. I was thinking about the implementation of the factorial in Python. I was comparing in my mind 2 different solutions: The recursive one, and the one that uses a loop. Here are example implementations for them: def factorial_recursive(n): if n == 1: return 1 return n * factorial_recursive(n - 1) def factorial_loop(n): result = 1 for i in range(1, n + 1): result *= i return result I know that the recursive one is problematic, because it's putting a lot of items on the stack. In fact it's using the stack as if it was a loop variable. The stack wasn't meant to be used like that. Then the question came to me, why? Maybe the stack could be built to handle this kind of (ab)use? I read about tail-call optimization on Wikipedia. If I understand correctly, the gist of it is that the interpreter tries to recognize, on a frame-by-frame basis, which frames could be completely eliminated, and then it eliminates those. Then I read Guido's blog post explaining why he doesn't want it in Python. In that post he outlined 4 different reasons why TCO shouldn't be implemented in Python. But then I thought, maybe you could do something smarter than eliminating individual stack frames. Maybe we could create something that is to the current implementation of the stack what `xrange` is to the old-style `range`. A smart object that allows access to any of a long list of items in it, without actually having to store those items. This would solve the first argument that Guido raises in his post, which I found to be the most substantial one. What I'm saying is: Imagine the stack of the interpreter when it runs the factorial example above for n=1000. It has around 1000 items in it and it's just about to explode. But then, if you'd look at the contents of that stack, you'd see it's embarrassingly regular, a compression algorithm's wet dream. It's just the same code location over and over again, with a different value for `n`. So what I'm suggesting is an algorithm to compress that stack on the fly. An algorithm that would detect regularities in the stack and instead of saving each individual frame, save just the pattern. Then, there wouldn't be any problem with showing informative stack trace: Despite not storing every individual frame, each individual frame could still be accessed, similarly to how `xrange` allow access to each individual member without having to store each of them. Then, the stack could store a lot more items, and tasks that currently require recursion (like pickling using the standard library) will be able to handle much deeper recursions. What do you think? Ram.

A slightly detached view (please correct me I'm wrong): Most stack frames are populated with very few items compared to allocated block (validation needed) That's awesome for compression. However, most items on the stack are references, i.e. pointers to memory locations. Some of these repeat, e.g. `self`, though I would suspect for most interesting tasks (e.g. traversing a graph), most references are disparate, if not unique. Granted, unique pointers still typically have compressible prefix. IMO CPython has a potential to be a cache trasher -- smallest frame I could find is 448 bytes, the LCM to 1K is 7K, in other words 16 recursions and values are aligned in the same slots relative to 1024K, and your 1-way assoc. cache is done; twice that and 2-way assoc. cache is done. May be completely masked by other memory accesses these frame do though, so don't quote me on the performance hit here. I suggest you capture a few deep stacks in problems from different domains and present your findings. (*) I've no idea how pypy implments stack On 27 May 2013 12:39, Ram Rachum <ram@rachum.com> wrote:

A slightly detached view (please correct me I'm wrong): Most stack frames are populated with very few items compared to allocated block (validation needed) That's awesome for compression. However, most items on the stack are references, i.e. pointers to memory locations. Some of these repeat, e.g. `self`, though I would suspect for most interesting tasks (e.g. traversing a graph), most references are disparate, if not unique. Granted, unique pointers still typically have compressible prefix. IMO CPython has a potential to be a cache trasher -- smallest frame I could find is 448 bytes, the LCM to 1K is 7K, in other words 16 recursions and values are aligned in the same slots relative to 1024K, and your 1-way assoc. cache is done; twice that and 2-way assoc. cache is done. May be completely masked by other memory accesses these frame do though, so don't quote me on the performance hit here. I suggest you capture a few deep stacks in problems from different domains and present your findings. (*) I've no idea how pypy implments stack On 27 May 2013 12:39, Ram Rachum <ram@rachum.com> wrote:
participants (2)
-
Dima Tisnek
-
Ram Rachum