[pypy-dev] 1.4.1 threading

Paolo Giarrusso p.giarrusso at gmail.com
Fri Dec 31 17:59:05 CET 2010

On Tue, Dec 28, 2010 at 02:47, Dima Tisnek <dimaqq at gmail.com> wrote:
> Thanks for the RTFP pointer, the Triplett ref, hash tab via rt is
> really interesting!
> I think I'll have to read it again and again to get all out of it...

Ah, I remember that! Never found enough time to read up that... also I
remember that I didn't like it, at the time, because it seemed to give
a new name to existing design practice in the RCU community. I didn't
realize that even if the article just described existing work, the
generalization alone might be an important contribution.

> This brings me to a new question: how much is pypy python memory model
> allowed to diverge from that of cpython?

In fact, Jython and IronPython already have different memory models:

But the most important answer is that to avoid reinventing the wheel,
one needs to know the state of the art, i.e. the work by Sarita Adve,
Jeremy Manson, William Pugh and Hans Boehm on memory models for Java
and C++.

The simplest introduction I know of comes in this blog post by Jeremy Manson:

As a matter of fact, this work is unsurprisingly often totally ignored
in many discussions about Python's memory model. I'm not surprised
because it's complex stuff, and understanding it doesn't matter for
today's Python programming.

The key concept of these models is that _properly synchronized_
programs get "sequential consistent" semantics: the programs behaves
as if all memory operations were interleaved in a total order on a
single coherent memory, without reorderings due to cache, compiler
optimizations and so on. Sequential consistency is the key property of
what you call "GIL-based monotonic world".
Exactly as you suggested for dictionaries, but more in general,
programs which are not properly synchronized (which exhibit so-called
data races) are more-or-less left on their own. On this topic, the
Java and C++ models are different, and that leaves a design space
open, but one which is already somewhat smaller, and which doesn't
affect the semantics of properly synchronized programs. Java tries to
define the semantics of data races, C++ just gives up.

* Unlike C++ and like Java, it is probably important to guarantee that
programs with data races (not properly synchronized memory accesses)
don't crash the interpreter or corrupt native memory.
You don't want native code in the interpreter to crash or to loop if
the programmer forgets synchronization - simply because that's very
difficult to debug, short of running a debugger on the interpreter
itself (or fixing the code without a debugger at all).

It is not clear, though, if one should give specific semantics to
programs with data races, other than ensure the interpreter doesn't
crash: basically it looks too complex. For whoever understands the
rest of the Java Memory Model, this blog post explains the semantics
they use to model this.

* Another difference between Java and C++ is about security
guarantees: it is crucial in Java that untrusted code cannot read
sensitive data (e.g., a password) managed by trusted code, not even
through a data race. Some kinds of behavior would allow
"out-of-thin-air values", which in practice might be such sensitive
data, if memory is zeroed by the memory allocator but one thread still
sees the old value.

I believe the most up-to-date introduction, with a useful
bibliography, is by S. Adve and H. Boehm [1]. Then you will probably
want to read [2] and [3] about C++ and Java. Through Google Scholar,
you should find you copies of all of this on CiteseerX.
[1] Sarita V. Adve and Hans-J. Boehm. 2010. Memory models: a case for
rethinking parallel languages and hardware. Communications of the ACM.
[2] Hans-J. Boehm and Sarita V. Adve. 2008. Foundations of the C++
concurrency memory model. PLDI '08
[3] Jeremy Manson, William Pugh, and Sarita V. Adve. 2005. The Java
memory model. POPL '05.

> For example, dict lookups, etc, could be relativistic in the absense
> of explicit synchronization, that is, dict updates are not quaranteed
> to propagate [fast] until [some] therad acquires a lock [or
> something].

Talking about dictionaries restricts the scope unnecessarily. You want
to relax these requirement for memory operations in general.

> Would this be a worthwhile departure from GIL-based monotonic world?

Years of research agree the answer is yes, no doubt. Otherwise several
fundamental optimizations become impossible - including the usage of
memory caches.
To make a simple example, releasing a lock requires flushing pending
writes (i.e. a cache miss, ~100 cycles nowadays). Implementing
sequential consistency (what you term "monotonic world") means doing
that after _each_ memory write.
If you read Adve et Boehm [1], they'll elaborate more on the topic.

> That is thread-safe code, that uses locks or some other mechanism
> [known to pypy] would see operations timely and in order, while code
> that doesn't synchronize its operations may see somewhat outdated
> shared state, as if threads executed in slightly different global
> order.

Yes. However, the programmer can't and shouldn't synchronize, say, field
lookup, so those dictionaries (and only those) need to be thread-safe
- Jython uses a
ConcurrentHashMap for this purpose, even too that's too thread-safe
(there needs to be no thread-safety guarantee on the values of

> About profiling, how do I distinguish between attribute vs globals vs
> builtins lookups in the profile?

Others can surely answer better, since I don't know the code; I guess
I'd add a "type" field to the dictionary to store such information, or
maybe to the call sites, depending on what's easier to annotate; but I
also guess this naive solution could require lots of work.

> d.
> On 27 December 2010 12:50, Paolo Giarrusso <p.giarrusso at gmail.com> wrote:
>> On Mon, Dec 27, 2010 at 19:54, Dima Tisnek <dimaqq at gmail.com> wrote:
>>> what do you think of liburcu? lttng.org/urcu
>> I've studied RCU in the Linux kernel (from which URCU derives) and
>> thought for a long time of using it for this problem. In short, once
>> you have GC, RCU (as in the Linux kernel) becomes (almost) trivial,
>> because RCU is almost entirely about how to delay freeing objects.
>> The RCU-GC connection is also mentioned by Documentation/RCU/RTFP.txt
>> in the Linux sources.
>> So, you need to just read about persistent data structures, and
>> remember that persistence is not the point - data structures which are
>> just almost persistent can also be OK, if you get them right. If you
>> replace a value in a dictionary, for instance, you don't necessarily
>> need to keep the old version around, as long as you the value fits in
>> a memory word so that it can be atomically replaced.
>> http://hackerboss.com/copy-on-write-101-part-2-putting-it-to-use/
>> Speaking of libraries, you might want to look at this for the atomic
>> primitives you'll need:
>> www.hpl.hp.com/research/linux/atomic_ops/
>> I'm more used to the primitives of the Linux kernel, I remember I
>> didn't fully like that library, but IIRC it was a matter of taste -
>> the library is from Hans Boehm, which makes it trustworthy.
>>> p.s. if/when I get around to profiling pypy, which is something I want
>>> to do, I'll be sure to share the results here.
>> Great, thanks!
>>> d.
>>> On 27 December 2010 10:57, Paolo Giarrusso <p.giarrusso at gmail.com> wrote:
>>>> On Mon, Dec 27, 2010 at 07:00, Dima Tisnek <dimaqq at gmail.com> wrote:
>>>>> (dan) random binary trees: O(log2(n)) is 7.2 for builtins, though the
>>>>> constant factor for locks, etc, might make it worthwhile
>>>> "7.2"? What is that? Do you mean 7.2=log2(no. of builtins)?
>>>>> (paolo) non blocking hash maps: memory barriers can be quite costly too
>>>> Well, you just _cannot_ avoid them, whatever is your fancy data structure.
>>>> You can only move most of the cost (or all of it) to write operations
>>>> - which is helpful if reads are most common, a case which you discuss
>>>> in your mail.
>>>> == Why memory barriers are required (in more detail) ==
>>>> If you use persistent data structures, also called copy-on-write or
>>>> purely functional (and that's the most obvious way to use a tree in a
>>>> parallel setting), the whole synchronization problem is reduced to
>>>> exchanging a pointer (to the root of the tree) between a writer and a
>>>> reader thread.
>>>> Let's consider this in the Java memory model (the state of the art).
>>>> In this model the writer thread has to use an expensive StoreLoad
>>>> barrier, i.e. mfence on x86 [1], costing more or less as much as a
>>>> cache miss, while the reader needs just a cheap Load* barrier (for
>>>> free on x86, cheap on other processor).
>>>> If you don't use a volatile field, there is very little guarantee
>>>> about what your program means, and it is almost always buggy; the
>>>> equivalent in the new C++ memory model (which is otherwise very
>>>> similar) gives completely undefined semantics (and even crashes are
>>>> possible in practice, for instance if the reader thread sees a
>>>> not-fully-initialized object because of the race and invokes a virtual
>>>> method on it).
>>>> [1] http://g.oswego.edu/dl/jmm/cookbook.html
>>>> == Other stuff ==
>>>>> different options will have to be implemented and tested when we get there,
>>>>> speaking on which, is there a test dict load?
>>>>> does anyone have some profiling data, what portion of runtime is spent
>>>>> reading and writing attributes and globals anyways?
>>>> I'd be interested as well in such data.
>>>>> Anyhow, back to parallel interpretation, Is there an issue or page
>>>>> that tracks what's needed for parallel interpretation?
>>>>> so far  mentioned: gc, dict, c api
>>>>> Btw, I think that jit is more important at the moment, but time comes
>>>>> when jit juice has been mostly squeezed out ;-)
>>>> --
>>>> Paolo Giarrusso - Ph.D. Student
>>>> http://www.informatik.uni-marburg.de/~pgiarrusso/
>> --
>> Paolo Giarrusso - Ph.D. Student
>> http://www.informatik.uni-marburg.de/~pgiarrusso/

Paolo Giarrusso - Ph.D. Student

More information about the Pypy-dev mailing list