[Python-Dev] Advice sought on memory allocation latency reduction C1X standard proposal

Wed Sep 22 22:12:59 CEST 2010

Dear Python Devs,

I am hoping to gain feedback on an ISO C1X/C++ standard library 
proposal I hope to submit. It consists of a rationale 
(http://mallocv2.wordpress.com/) which shows how growth in RAM 
capacity is exponentially outgrowing the growth in RAM access speed. 
The consequences are profound: computer software which has always 
been written under the assumption of scarcity of RAM capacity will 
need to be retargeted to assume the scarcity of RAM access speed 
instead.  

The C1X proposal (http://mallocv2.wordpress.com/the-c-proposal-text/) 
enables four things of interest to Python: (i) aligned block resizing 
(ii) speculative in-place block resizing (iii) batch block allocation 
and (iv) the ability to reserve address space, thus avoiding the need 
to overallocate array storage.  

Aligned block resizing is especially useful to numpy. Where one has 
an array of aligned SSE vector quantities one cannot currently resize 
that block and guarantee that alignment will not be destroyed. With 
the new feature of non-relocating realloc() and being able to specify 
an alignment to realloc() one may avoid memory copying, and therefore 
reduce memory bandwidth utilisation and therefore overall memory 
access latencies.

The ability to reserve address space and speculative in-place block 
resizing can be combined to allow Python to reserve an arbitrary 
amount of address space after the storage for an array object. Should 
the array then become extended, the speculative in-place block 
resizing can attempt to expand storage into that reserved space 
without having to relocate the contents of the storage. This again 
translates into much reduced memory copying as well as memory 
consumption, and once again reduces overall memory access latencies.

Lastly, the batch allocation mechanism allows a sequence of 
allocations to be performed at once. I don't know of any attempts to 
have Python make use of similar functionality in Linux's system 
allocator, however Perl saw a 18% reduction in startup time 
(http://groups.google.com/group/perl-compiler/msg/31bca5297764002b).

I am not familiar with Python's implementation outside working 
extensively with Boost.Python, so I was hoping that this list could 
advise me on what I might be forgetting, what problems there could be 
for Python with this design and/or any other general concerns and 
thoughts. I thank the list in advance for your time and 
consideration.

Niall Douglas

-- 
Technology & Consulting Services - ned Productions Limited.
http://www.nedproductions.biz/. VAT reg: IE 9708311Q. Company no: 
472909.