<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Feb 17, 2014 at 7:31 PM, Julian Taylor <span dir="ltr"><<a href="mailto:jtaylor.debian@googlemail.com" target="_blank">jtaylor.debian@googlemail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">hi,<br>

I noticed that during some simplistic benchmarks (e.g.<br>

<a href="https://github.com/numpy/numpy/issues/4310" target="_blank">https://github.com/numpy/numpy/issues/4310</a>) a lot of time is spent in<br>

the kernel zeroing pages.<br>

This is because under linux glibc will always allocate large memory<br>

blocks with mmap. As these pages can come from other processes the<br>

kernel must zero them for security reasons.<br></blockquote><div><br></div><div>Do you have numbers for 'a lot of time' ? Is the above script the exact one you used for benchmarking this issue ?</div><div><br></div>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

For memory within the numpy process this unnecessary and possibly a<br>

large overhead for the many temporaries numpy creates.<br>

<br>

The behavior of glibc can be tuned to change the threshold at which it<br>

starts using mmap but that would be a platform specific fix.<br>

<br>

I was thinking about adding a thread local cache of pointers to of<br>

allocated memory.<br>

When an array is created it tries to get its memory from the cache and<br>

when its deallocated it returns it to the cache.<br>

The threshold and cached memory block sizes could be adaptive depending<br>

on the application workload.<br>

<br>

For simplistic temporary heavy benchmarks this eliminates the time spent<br>

in the kernel (system with time).<br></blockquote><div><br></div><div>For this kind of setup, I would advise to look into perf on linux. It should be much more precise than time.</div><div><br></div><div>If nobody beats me to it, I can try to look at this this WE,</div>

<div><br></div><div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

But I don't know how relevant this is for real world applications.<br>

Have you noticed large amounts of time spent in the kernel in your apps?<br></blockquote><div><br></div><div>In my experience, more time is spent on figuring out how to spare memory than speeding this kind of operations for 'real life applications' (TM).</div>

<div><br></div><div>What happens to your benchmark if you tune malloc to not use mmap at all ?</div><div><br></div><div>David</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<br>

I also found this paper which describes pretty much exactly what I'm<br>

proposing:<br>

<a href="http://pyhpc.org/workshop/papers/Doubling.pdf" target="_blank">pyhpc.org/workshop/papers/Doubling.pdf</a>‎<br>

<br>

Someone know why their changes were never incorporated in numpy? I<br>

couldn't find a reference in the list archive.<br>

_______________________________________________<br>

NumPy-Discussion mailing list<br>

<a href="mailto:NumPy-Discussion@scipy.org">NumPy-Discussion@scipy.org</a><br>

<a href="http://mail.scipy.org/mailman/listinfo/numpy-discussion" target="_blank">http://mail.scipy.org/mailman/listinfo/numpy-discussion</a><br>

</blockquote></div><br></div></div>