Even in the Erlang model, the afore-mentioned issues of bus contention put a cap on the number of threads you can run in any given application assuming there&#39;s any amount of cross-thread synchronization. I wrote a blog post on this subject with respect to my experience in tuning RabbitMQ on NUMA architectures.<div>

<div><br></div><div><meta http-equiv="content-type" content="text/html; charset=utf-8"><a href="http://blog.agoragames.com/blog/2011/06/24/of-penguins-rabbits-and-buses/">http://blog.agoragames.com/blog/2011/06/24/of-penguins-rabbits-and-buses/</a></div>

<div><br></div><div>It should be noted that Erlang processes are not the same as OS processes. They are more akin to green threads, scheduled on N number of legit OS threads which are in turn run on C number of cores. The end effect is the same though, as the data is effectively shared across NUMA nodes, which runs into basic physical constraints.</div>

<div><br></div><div>I used to think the GIL was a major bottleneck, and though I&#39;m not fond of it, my recent experience has highlighted that *any* application which uses shared memory will have significant bus contention when scaling across all cores. The best course of action is shared-nothing MPI style, but in 64bit land, that can mean significant wasted address space.</div>

<div><br></div><div><a href="http://blog.agoragames.com/blog/2011/06/24/of-penguins-rabbits-and-buses/"></a>-Aaron</div><div><br><br><div class="gmail_quote">On Fri, Aug 12, 2011 at 2:59 PM, Sturla Molden <span dir="ltr">&lt;<a href="mailto:sturla@molden.no">sturla@molden.no</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Den 12.08.2011 18:51, skrev Xavier Morel:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

* Erlang uses &quot;erlang processes&quot;, which are very cheap preempted *processes* (no shared memory). There have always been tens to thousands to millions of erlang processes per interpreter source contention within the interpreter going back to pre-SMP by setting the number of schedulers per node to 1 can yield increased overall performances) <br>


</blockquote>

<br>

Technically, one can make threads behave like processes if they don&#39;t share memory pages (though they will still share address space). Erlangs use of &#39;process&#39; instead of &#39;thread&#39; does not mean an Erlang process has to be implemented as an OS process. With one interpreter per thread, and a malloc that does not let threads share memory pages (one heap per thread), Python could do the same.<br>


<br>

On Windows, there is an API function called HeapAlloc, which lets us allocate memory form a dedicated heap. The common use case is to prevent threads from sharing memory, thus behaving like light-weight processes (except address space is shared). On Unix, is is more common to use fork() to create new processes instead, as processes are more light-weight than on Windows.<br>


<br>

Sturla<br><br></blockquote></div><br>

</div></div>