Hi, Somehow I'm getting exactly same performance in threaded and non-threaded code (I would expect at least some speedup), Is threading not there in this pypy release in practice or am I missing some argument to pypy? the test case is cpu-bound in python, and possibly gc-bound too
2010/12/22 Dima Tisnek <dimaqq@gmail.com>:
Hi,
Somehow I'm getting exactly same performance in threaded and non-threaded code (I would expect at least some speedup), Is threading not there in this pypy release in practice or am I missing some argument to pypy?
PyPy has a GIL if that's what you mean. -- Regards, Benjamin
Oh, I can't yet use alternative gc that obviates GIL then? Or am I totally confused and pypy still uses GIL for other reasons, e.g. globals dict safety? On 22 December 2010 19:49, Benjamin Peterson <benjamin@python.org> wrote:
2010/12/22 Dima Tisnek <dimaqq@gmail.com>:
Hi,
Somehow I'm getting exactly same performance in threaded and non-threaded code (I would expect at least some speedup), Is threading not there in this pypy release in practice or am I missing some argument to pypy?
PyPy has a GIL if that's what you mean.
-- Regards, Benjamin
On Wed, Dec 22, 2010 at 7:05 PM, Benjamin Peterson <benjamin@python.org> wrote:
2010/12/22 Dima Tisnek <dimaqq@gmail.com>:
Oh, I can't yet use alternative gc that obviates GIL then?
Or am I totally confused and pypy still uses GIL for other reasons, e.g. globals dict safety?
All of the above.
On the topic of parallel dict safety... I've not played around with parallel dictionaries at all, but I've heard that treaps can be parallelized nicely, and they can be (and have been) set up to look much like a python dictionary. I admit, I coded one and put it at http://stromberg.dnsalias.org/~strombrg/treap/ - however, the implementation is not (yet?) suitable for parallel use. It's coded in such a way that it'll work as pure python or cython, depending on which of two m4 symbols you define. I've used the pure python version on pypy. Treaps tend to make just about everything O(log(n)), have good average performance, and are not as even performers as red-black trees (that is, they give a higher standard deviation, and a better average performance than red-black trees). And they don't play that nicely with locality of reference (caches) - things are intentionally randomized. But if you throw enough cores at them, they might still be a win compared to a big hash table with a big lock wrapped around it.
On Thu, Dec 23, 2010 at 06:58, Dan Stromberg <drsalists@gmail.com> wrote:
On Wed, Dec 22, 2010 at 7:05 PM, Benjamin Peterson <benjamin@python.org> wrote:
2010/12/22 Dima Tisnek <dimaqq@gmail.com>:
Oh, I can't yet use alternative gc that obviates GIL then?
Or am I totally confused and pypy still uses GIL for other reasons, e.g. globals dict safety?
All of the above.
On the topic of parallel dict safety... I've not played around with parallel dictionaries at all, but I've heard that treaps can be parallelized nicely, and they can be (and have been) set up to look much like a python dictionary. I admit, I coded one and put it at http://stromberg.dnsalias.org/~strombrg/treap/ - however, the implementation is not (yet?) suitable for parallel use.
It's coded in such a way that it'll work as pure python or cython, depending on which of two m4 symbols you define. I've used the pure python version on pypy.
Treaps tend to make just about everything O(log(n)), have good average performance, and are not as even performers as red-black trees (that is, they give a higher standard deviation, and a better average performance than red-black trees). And they don't play that nicely with locality of reference (caches) - things are intentionally randomized. But if you throw enough cores at them, they might still be a win compared to a big hash table with a big lock wrapped around it.
That might be interesting; however, the state of the art includes many concurrent hash map alternatives which are way smarter than "a lock wrapped around". Java >=1.5 has one, by Doug Lea*. An even better (and more complex) one, by Cliff Click*, used in production on >500 cores, and presented to JavaOne, can be found by googling: nonblockinghashmap Cliff Click A C implementation can be found at: http://code.google.com/p/nbds/ Link found through this blog post of Cliff Click: http://www.azulsystems.com/blog/cliff-click/2008-11-13-some-source-forge-thr... However, I don't think you can code them in nowadays Python, because it offers neither a Java-equivalent volatile attribute nor compare-and-swap nor memory barriers, and these structures can't be implemented with plain locks. Best regards * Doug Lea wrote the memory allocator used on Linux, is behind the 1.5 java.util.concurrent package, and much more. Cliff Click is a high-performance JVM hacker. -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
(dan) random binary trees: O(log2(n)) is 7.2 for builtins, though the constant factor for locks, etc, might make it worthwhile (paolo) non blocking hash maps: memory barriers can be quite costly too different options will have to be implemented and tested when we get there, speaking on which, is there a test dict load? does anyone have some profiling data, what portion of runtime is spent reading and writing attributes and globals anyways? while we are on the subject, is there a plan to provide different levels of sparseness for different levels of name lookup? for example globals vs builtins, first needs to be quite sparce so that builtins show through well, second hardly, because there's nowhere else to look if builtins don't have the name. then the tradeoff could be more dynamic, that is frequently accessed dicts could be e.g. more sparse and rarely accessed more compact? (obviousely more sparse is not always better, e.g. in terms of cpu cache) of course "frequently accessed" is not as easy as frequently ran code, e.g. here "class A: ...; while 1: A().func()", func lookup occurs in different objects, yet it is same logical operation. come to think of it, is there any point in polymorphic dicts, e.g. attribute access could be imeplemented as a near-perfect compact hash map if attribute names change rarely, while regular dict()s are expected to change keys often. Anyhow, back to parallel interpretation, Is there an issue or page that tracks what's needed for parallel interpretation? so far mentioned: gc, dict, c api Btw, I think that jit is more important at the moment, but time comes when jit juice has been mostly squeezed out ;-) d. On 23 December 2010 07:31, Paolo Giarrusso <p.giarrusso@gmail.com> wrote:
On Thu, Dec 23, 2010 at 06:58, Dan Stromberg <drsalists@gmail.com> wrote:
On Wed, Dec 22, 2010 at 7:05 PM, Benjamin Peterson <benjamin@python.org> wrote:
2010/12/22 Dima Tisnek <dimaqq@gmail.com>:
Oh, I can't yet use alternative gc that obviates GIL then?
Or am I totally confused and pypy still uses GIL for other reasons, e.g. globals dict safety?
All of the above.
On the topic of parallel dict safety... I've not played around with parallel dictionaries at all, but I've heard that treaps can be parallelized nicely, and they can be (and have been) set up to look much like a python dictionary. I admit, I coded one and put it at http://stromberg.dnsalias.org/~strombrg/treap/ - however, the implementation is not (yet?) suitable for parallel use.
It's coded in such a way that it'll work as pure python or cython, depending on which of two m4 symbols you define. I've used the pure python version on pypy.
Treaps tend to make just about everything O(log(n)), have good average performance, and are not as even performers as red-black trees (that is, they give a higher standard deviation, and a better average performance than red-black trees). And they don't play that nicely with locality of reference (caches) - things are intentionally randomized. But if you throw enough cores at them, they might still be a win compared to a big hash table with a big lock wrapped around it.
That might be interesting; however, the state of the art includes many concurrent hash map alternatives which are way smarter than "a lock wrapped around". Java >=1.5 has one, by Doug Lea*. An even better (and more complex) one, by Cliff Click*, used in production on >500 cores, and presented to JavaOne, can be found by googling:
nonblockinghashmap Cliff Click
A C implementation can be found at: http://code.google.com/p/nbds/
Link found through this blog post of Cliff Click: http://www.azulsystems.com/blog/cliff-click/2008-11-13-some-source-forge-thr...
However, I don't think you can code them in nowadays Python, because it offers neither a Java-equivalent volatile attribute nor compare-and-swap nor memory barriers, and these structures can't be implemented with plain locks.
Best regards
* Doug Lea wrote the memory allocator used on Linux, is behind the 1.5 java.util.concurrent package, and much more. Cliff Click is a high-performance JVM hacker. -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ _______________________________________________ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev
On 27 December 2010 17:00, Dima Tisnek <dimaqq@gmail.com> wrote:
while we are on the subject, is there a plan to provide different levels of sparseness for different levels of name lookup? for example globals vs builtins, first needs to be quite sparce so that builtins show through well, second hardly, because there's nowhere else to look if builtins don't have the name. then the tradeoff could be more dynamic, that is frequently accessed dicts could be e.g. more sparse and rarely accessed more compact? (obviousely more sparse is not always better, e.g. in terms of cpu cache) of course "frequently accessed" is not as easy as frequently ran code, e.g. here "class A: ...; while 1: A().func()", func lookup occurs in different objects, yet it is same logical operation.
come to think of it, is there any point in polymorphic dicts, e.g. attribute access could be imeplemented as a near-perfect compact hash map if attribute names change rarely, while regular dict()s are expected to change keys often.
No, not really, but pypy already heavily optimises this case - see mapdicts, celldicts, and sharing dicts. Celldict is commonly used for module dictionaries, and emulates a direct pointer to the value, which can be cached along side the LOAD_GLOBAL instruction. I implemented a similar system (pre-computed array based on known module-global names) a few years back in pypy (pre-jit) and got a 6% speedup over regular dicts, but celldicts get a similar speedup and are more generally applicable. As for perfect hashing: our existing mechanism for hashing beats it, hands down. In cpython at least, I haven't checked the pypy source on this topic, the hash of a string is cached on the string object itself, which means in the case of identifiers no hash is ever computed on global lookup. The only thing that could really be faster is something like a slot on the symbol itself. Celldicts move the synchronisation point out of the hash table and into the entry for common cases, which changes the synchronisation question significantly.
Btw, I think that jit is more important at the moment, but time comes when jit juice has been mostly squeezed out ;-)
There are occasional memory-model discussions, but at the end of the day what will probably happen is the people who step up to do the work to implement it will probably also get to do most of the design work. -- William Leslie
On 12/27/2010 07:00 AM, Dima Tisnek wrote:
(dan) random binary trees: O(log2(n)) is 7.2 for builtins, though the constant factor for locks, etc, might make it worthwhile
(paolo) non blocking hash maps: memory barriers can be quite costly too
different options will have to be implemented and tested when we get there, speaking on which, is there a test dict load? does anyone have some profiling data, what portion of runtime is spent reading and writing attributes and globals anyways?
while we are on the subject, is there a plan to provide different levels of sparseness for different levels of name lookup? for example globals vs builtins, first needs to be quite sparce so that builtins show through well, second hardly, because there's nowhere else to look if builtins don't have the name. then the tradeoff could be more dynamic, that is frequently accessed dicts could be e.g. more sparse and rarely accessed more compact? (obviousely more sparse is not always better, e.g. in terms of cpu cache) of course "frequently accessed" is not as easy as frequently ran code, e.g. here "class A: ...; while 1: A().func()", func lookup occurs in different objects, yet it is same logical operation.
come to think of it, is there any point in polymorphic dicts, e.g. attribute access could be imeplemented as a near-perfect compact hash map if attribute names change rarely, while regular dict()s are expected to change keys often.
All these thoughts go into the wrong direction, imo. The JIT removes nearly all dictionary accesses to global dicts, instance and class dicts. Even without the JIT, purely interpreting things, there are caches that bypass most dict lookups. The only case where improving dicts would help is for user-defined dictionaries, i.e. when you write dicts with curly braces. that is basically very hard, because the usage patterns vary widely and a lot of work would be necessary to find common uses. Cheers, Carl Friedrich
On Mon, Dec 27, 2010 at 09:31, Carl Friedrich Bolz <cfbolz@gmx.de> wrote:
On 12/27/2010 07:00 AM, Dima Tisnek wrote:
(dan) random binary trees: O(log2(n)) is 7.2 for builtins, though the constant factor for locks, etc, might make it worthwhile
(paolo) non blocking hash maps: memory barriers can be quite costly too
different options will have to be implemented and tested when we get there, speaking on which, is there a test dict load? does anyone have some profiling data, what portion of runtime is spent reading and writing attributes and globals anyways?
while we are on the subject, is there a plan to provide different levels of sparseness for different levels of name lookup? for example globals vs builtins, first needs to be quite sparce so that builtins show through well, second hardly, because there's nowhere else to look if builtins don't have the name. then the tradeoff could be more dynamic, that is frequently accessed dicts could be e.g. more sparse and rarely accessed more compact? (obviousely more sparse is not always better, e.g. in terms of cpu cache) of course "frequently accessed" is not as easy as frequently ran code, e.g. here "class A: ...; while 1: A().func()", func lookup occurs in different objects, yet it is same logical operation.
come to think of it, is there any point in polymorphic dicts, e.g. attribute access could be imeplemented as a near-perfect compact hash map if attribute names change rarely, while regular dict()s are expected to change keys often.
All these thoughts go into the wrong direction, imo. The JIT removes nearly all dictionary accesses to global dicts, instance and class dicts. Even without the JIT, purely interpreting things, there are caches that bypass most dict lookups.
That's very interesting. However, aren't such caches also hash maps in the end (unless you do inline caching in your interpreter, like in Ruby 1.9)? I remember reading so in PyPy's docs; moreover, that's a standard optimization for method lookup. Sharing such caches between different interpreter threads would be potentially useful, if that implied no expensive synchronization for readers - which is possible for instance on x86 (read barriers are for free), by using persistent data structures. Writes to such caches should (hopefully) be rare enough to make the extra synchronization not too expensive, however this needs benchmarking. Best regards -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
On 12/27/2010 06:25 PM, Paolo Giarrusso wrote:
All these thoughts go into the wrong direction, imo. The JIT removes nearly all dictionary accesses to global dicts, instance and class dicts. Even without the JIT, purely interpreting things, there are caches that bypass most dict lookups.
That's very interesting. However, aren't such caches also hash maps in the end (unless you do inline caching in your interpreter, like in Ruby 1.9)? I remember reading so in PyPy's docs; moreover, that's a standard optimization for method lookup.
Indeed, some of the caches are again hash maps, e.g. the method cache is a global hash map. However, since it is a cache of a fixed size, its implementation is a lot simpler than that of dictionaries. Some of the caches (like that of attribute access) are indeed inline caches and thus don't need hash maps at all. At some point the method cache was an inline cache as well, but that didn't seem to actually help much. In general it seems that eliminating single dict lookups is rarely worth it in PyPy. We also had an inline global dict lookup cache at some point, but it didn't give an interesting enough effect to keep it. I'm talking purely about the interpreter here, of course, the jit gets rid of all these lookups anyway.
Sharing such caches between different interpreter threads would be potentially useful, if that implied no expensive synchronization for readers - which is possible for instance on x86 (read barriers are for free), by using persistent data structures. Writes to such caches should (hopefully) be rare enough to make the extra synchronization not too expensive, however this needs benchmarking.
Again, this is all idle speculation. If PyPy wants to move towards a GIL-less implementation, a *huge* amount of problems would need to be solved first before such optimizations become important. We would need to fix the GC. We would need to think about the memory model of RPython in the presence of multi-CPU threading, and that of Python. At the moment I don't see that work happening, because nobody in the current core contributor group is qualified or interested. Yes, in a sense that's not very forward-looking, given the move to multi-core. Otoh it's not clear to me that shared-memory multithreading is really the approach that makes most sense to PyPy in its current state. Cheers, Carl Friedrich
On Mon, Dec 27, 2010 at 07:00, Dima Tisnek <dimaqq@gmail.com> wrote:
(dan) random binary trees: O(log2(n)) is 7.2 for builtins, though the constant factor for locks, etc, might make it worthwhile
"7.2"? What is that? Do you mean 7.2=log2(no. of builtins)?
(paolo) non blocking hash maps: memory barriers can be quite costly too
Well, you just _cannot_ avoid them, whatever is your fancy data structure. You can only move most of the cost (or all of it) to write operations - which is helpful if reads are most common, a case which you discuss in your mail. == Why memory barriers are required (in more detail) == If you use persistent data structures, also called copy-on-write or purely functional (and that's the most obvious way to use a tree in a parallel setting), the whole synchronization problem is reduced to exchanging a pointer (to the root of the tree) between a writer and a reader thread. Let's consider this in the Java memory model (the state of the art). In this model the writer thread has to use an expensive StoreLoad barrier, i.e. mfence on x86 [1], costing more or less as much as a cache miss, while the reader needs just a cheap Load* barrier (for free on x86, cheap on other processor). If you don't use a volatile field, there is very little guarantee about what your program means, and it is almost always buggy; the equivalent in the new C++ memory model (which is otherwise very similar) gives completely undefined semantics (and even crashes are possible in practice, for instance if the reader thread sees a not-fully-initialized object because of the race and invokes a virtual method on it). [1] http://g.oswego.edu/dl/jmm/cookbook.html == Other stuff ==
different options will have to be implemented and tested when we get there, speaking on which, is there a test dict load? does anyone have some profiling data, what portion of runtime is spent reading and writing attributes and globals anyways?
I'd be interested as well in such data.
Anyhow, back to parallel interpretation, Is there an issue or page that tracks what's needed for parallel interpretation? so far mentioned: gc, dict, c api
Btw, I think that jit is more important at the moment, but time comes when jit juice has been mostly squeezed out ;-)
-- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
what do you think of liburcu? lttng.org/urcu p.s. if/when I get around to profiling pypy, which is something I want to do, I'll be sure to share the results here. d. On 27 December 2010 10:57, Paolo Giarrusso <p.giarrusso@gmail.com> wrote:
On Mon, Dec 27, 2010 at 07:00, Dima Tisnek <dimaqq@gmail.com> wrote:
(dan) random binary trees: O(log2(n)) is 7.2 for builtins, though the constant factor for locks, etc, might make it worthwhile
"7.2"? What is that? Do you mean 7.2=log2(no. of builtins)?
(paolo) non blocking hash maps: memory barriers can be quite costly too
Well, you just _cannot_ avoid them, whatever is your fancy data structure. You can only move most of the cost (or all of it) to write operations - which is helpful if reads are most common, a case which you discuss in your mail.
== Why memory barriers are required (in more detail) == If you use persistent data structures, also called copy-on-write or purely functional (and that's the most obvious way to use a tree in a parallel setting), the whole synchronization problem is reduced to exchanging a pointer (to the root of the tree) between a writer and a reader thread. Let's consider this in the Java memory model (the state of the art). In this model the writer thread has to use an expensive StoreLoad barrier, i.e. mfence on x86 [1], costing more or less as much as a cache miss, while the reader needs just a cheap Load* barrier (for free on x86, cheap on other processor). If you don't use a volatile field, there is very little guarantee about what your program means, and it is almost always buggy; the equivalent in the new C++ memory model (which is otherwise very similar) gives completely undefined semantics (and even crashes are possible in practice, for instance if the reader thread sees a not-fully-initialized object because of the race and invokes a virtual method on it).
[1] http://g.oswego.edu/dl/jmm/cookbook.html
== Other stuff ==
different options will have to be implemented and tested when we get there, speaking on which, is there a test dict load? does anyone have some profiling data, what portion of runtime is spent reading and writing attributes and globals anyways?
I'd be interested as well in such data.
Anyhow, back to parallel interpretation, Is there an issue or page that tracks what's needed for parallel interpretation? so far mentioned: gc, dict, c api
Btw, I think that jit is more important at the moment, but time comes when jit juice has been mostly squeezed out ;-)
-- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
On Mon, Dec 27, 2010 at 19:54, Dima Tisnek <dimaqq@gmail.com> wrote:
what do you think of liburcu? lttng.org/urcu
I've studied RCU in the Linux kernel (from which URCU derives) and thought for a long time of using it for this problem. In short, once you have GC, RCU (as in the Linux kernel) becomes (almost) trivial, because RCU is almost entirely about how to delay freeing objects. The RCU-GC connection is also mentioned by Documentation/RCU/RTFP.txt in the Linux sources. So, you need to just read about persistent data structures, and remember that persistence is not the point - data structures which are just almost persistent can also be OK, if you get them right. If you replace a value in a dictionary, for instance, you don't necessarily need to keep the old version around, as long as you the value fits in a memory word so that it can be atomically replaced. http://hackerboss.com/copy-on-write-101-part-2-putting-it-to-use/ Speaking of libraries, you might want to look at this for the atomic primitives you'll need: www.hpl.hp.com/research/linux/atomic_ops/ I'm more used to the primitives of the Linux kernel, I remember I didn't fully like that library, but IIRC it was a matter of taste - the library is from Hans Boehm, which makes it trustworthy.
p.s. if/when I get around to profiling pypy, which is something I want to do, I'll be sure to share the results here.
Great, thanks!
d.
On 27 December 2010 10:57, Paolo Giarrusso <p.giarrusso@gmail.com> wrote:
On Mon, Dec 27, 2010 at 07:00, Dima Tisnek <dimaqq@gmail.com> wrote:
(dan) random binary trees: O(log2(n)) is 7.2 for builtins, though the constant factor for locks, etc, might make it worthwhile
"7.2"? What is that? Do you mean 7.2=log2(no. of builtins)?
(paolo) non blocking hash maps: memory barriers can be quite costly too
Well, you just _cannot_ avoid them, whatever is your fancy data structure. You can only move most of the cost (or all of it) to write operations - which is helpful if reads are most common, a case which you discuss in your mail.
== Why memory barriers are required (in more detail) == If you use persistent data structures, also called copy-on-write or purely functional (and that's the most obvious way to use a tree in a parallel setting), the whole synchronization problem is reduced to exchanging a pointer (to the root of the tree) between a writer and a reader thread. Let's consider this in the Java memory model (the state of the art). In this model the writer thread has to use an expensive StoreLoad barrier, i.e. mfence on x86 [1], costing more or less as much as a cache miss, while the reader needs just a cheap Load* barrier (for free on x86, cheap on other processor). If you don't use a volatile field, there is very little guarantee about what your program means, and it is almost always buggy; the equivalent in the new C++ memory model (which is otherwise very similar) gives completely undefined semantics (and even crashes are possible in practice, for instance if the reader thread sees a not-fully-initialized object because of the race and invokes a virtual method on it).
[1] http://g.oswego.edu/dl/jmm/cookbook.html
== Other stuff ==
different options will have to be implemented and tested when we get there, speaking on which, is there a test dict load? does anyone have some profiling data, what portion of runtime is spent reading and writing attributes and globals anyways?
I'd be interested as well in such data.
Anyhow, back to parallel interpretation, Is there an issue or page that tracks what's needed for parallel interpretation? so far mentioned: gc, dict, c api
Btw, I think that jit is more important at the moment, but time comes when jit juice has been mostly squeezed out ;-)
-- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
-- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
Thanks for the RTFP pointer, the Triplett ref, hash tab via rt is really interesting! I think I'll have to read it again and again to get all out of it... This brings me to a new question: how much is pypy python memory model allowed to diverge from that of cpython? For example, dict lookups, etc, could be relativistic in the absense of explicit synchronization, that is, dict updates are not quaranteed to propagate [fast] until [some] therad acquires a lock [or something]. Of course the order of dict mutation has to be preserved. Would this be a worthwhile departure from GIL-based monotonic world? That is thread-safe code, that uses locks or some other mechanism [known to pypy] would see operations timely and in order, while code that doesn't synchronize its operations may see somewhat outdated shared state, as if threads executed in slightly different global order. About profiling, how do I distinguish between attribute vs globals vs builtins lookups in the profile? d. On 27 December 2010 12:50, Paolo Giarrusso <p.giarrusso@gmail.com> wrote:
On Mon, Dec 27, 2010 at 19:54, Dima Tisnek <dimaqq@gmail.com> wrote:
what do you think of liburcu? lttng.org/urcu
I've studied RCU in the Linux kernel (from which URCU derives) and thought for a long time of using it for this problem. In short, once you have GC, RCU (as in the Linux kernel) becomes (almost) trivial, because RCU is almost entirely about how to delay freeing objects. The RCU-GC connection is also mentioned by Documentation/RCU/RTFP.txt in the Linux sources.
So, you need to just read about persistent data structures, and remember that persistence is not the point - data structures which are just almost persistent can also be OK, if you get them right. If you replace a value in a dictionary, for instance, you don't necessarily need to keep the old version around, as long as you the value fits in a memory word so that it can be atomically replaced. http://hackerboss.com/copy-on-write-101-part-2-putting-it-to-use/
Speaking of libraries, you might want to look at this for the atomic primitives you'll need: www.hpl.hp.com/research/linux/atomic_ops/ I'm more used to the primitives of the Linux kernel, I remember I didn't fully like that library, but IIRC it was a matter of taste - the library is from Hans Boehm, which makes it trustworthy.
p.s. if/when I get around to profiling pypy, which is something I want to do, I'll be sure to share the results here.
Great, thanks!
d.
On 27 December 2010 10:57, Paolo Giarrusso <p.giarrusso@gmail.com> wrote:
On Mon, Dec 27, 2010 at 07:00, Dima Tisnek <dimaqq@gmail.com> wrote:
(dan) random binary trees: O(log2(n)) is 7.2 for builtins, though the constant factor for locks, etc, might make it worthwhile
"7.2"? What is that? Do you mean 7.2=log2(no. of builtins)?
(paolo) non blocking hash maps: memory barriers can be quite costly too
Well, you just _cannot_ avoid them, whatever is your fancy data structure. You can only move most of the cost (or all of it) to write operations - which is helpful if reads are most common, a case which you discuss in your mail.
== Why memory barriers are required (in more detail) == If you use persistent data structures, also called copy-on-write or purely functional (and that's the most obvious way to use a tree in a parallel setting), the whole synchronization problem is reduced to exchanging a pointer (to the root of the tree) between a writer and a reader thread. Let's consider this in the Java memory model (the state of the art). In this model the writer thread has to use an expensive StoreLoad barrier, i.e. mfence on x86 [1], costing more or less as much as a cache miss, while the reader needs just a cheap Load* barrier (for free on x86, cheap on other processor). If you don't use a volatile field, there is very little guarantee about what your program means, and it is almost always buggy; the equivalent in the new C++ memory model (which is otherwise very similar) gives completely undefined semantics (and even crashes are possible in practice, for instance if the reader thread sees a not-fully-initialized object because of the race and invokes a virtual method on it).
[1] http://g.oswego.edu/dl/jmm/cookbook.html
== Other stuff ==
different options will have to be implemented and tested when we get there, speaking on which, is there a test dict load? does anyone have some profiling data, what portion of runtime is spent reading and writing attributes and globals anyways?
I'd be interested as well in such data.
Anyhow, back to parallel interpretation, Is there an issue or page that tracks what's needed for parallel interpretation? so far mentioned: gc, dict, c api
Btw, I think that jit is more important at the moment, but time comes when jit juice has been mostly squeezed out ;-)
-- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
-- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
On Tue, Dec 28, 2010 at 02:47, Dima Tisnek <dimaqq@gmail.com> wrote:
Thanks for the RTFP pointer, the Triplett ref, hash tab via rt is really interesting!
I think I'll have to read it again and again to get all out of it...
Ah, I remember that! Never found enough time to read up that... also I remember that I didn't like it, at the time, because it seemed to give a new name to existing design practice in the RCU community. I didn't realize that even if the article just described existing work, the generalization alone might be an important contribution.
This brings me to a new question: how much is pypy python memory model allowed to diverge from that of cpython?
In fact, Jython and IronPython already have different memory models: code.google.com/p/unladen-swallow/wiki/MemoryModel But the most important answer is that to avoid reinventing the wheel, one needs to know the state of the art, i.e. the work by Sarita Adve, Jeremy Manson, William Pugh and Hans Boehm on memory models for Java and C++. The simplest introduction I know of comes in this blog post by Jeremy Manson: http://jeremymanson.blogspot.com/2008/11/what-volatile-means-in-java.html As a matter of fact, this work is unsurprisingly often totally ignored in many discussions about Python's memory model. I'm not surprised because it's complex stuff, and understanding it doesn't matter for today's Python programming. The key concept of these models is that _properly synchronized_ programs get "sequential consistent" semantics: the programs behaves as if all memory operations were interleaved in a total order on a single coherent memory, without reorderings due to cache, compiler optimizations and so on. Sequential consistency is the key property of what you call "GIL-based monotonic world". Exactly as you suggested for dictionaries, but more in general, programs which are not properly synchronized (which exhibit so-called data races) are more-or-less left on their own. On this topic, the Java and C++ models are different, and that leaves a design space open, but one which is already somewhat smaller, and which doesn't affect the semantics of properly synchronized programs. Java tries to define the semantics of data races, C++ just gives up. * Unlike C++ and like Java, it is probably important to guarantee that programs with data races (not properly synchronized memory accesses) don't crash the interpreter or corrupt native memory. You don't want native code in the interpreter to crash or to loop if the programmer forgets synchronization - simply because that's very difficult to debug, short of running a debugger on the interpreter itself (or fixing the code without a debugger at all). It is not clear, though, if one should give specific semantics to programs with data races, other than ensure the interpreter doesn't crash: basically it looks too complex. For whoever understands the rest of the Java Memory Model, this blog post explains the semantics they use to model this. http://jeremymanson.blogspot.com/2007/08/causality-and-java-memory-model.htm... * Another difference between Java and C++ is about security guarantees: it is crucial in Java that untrusted code cannot read sensitive data (e.g., a password) managed by trusted code, not even through a data race. Some kinds of behavior would allow "out-of-thin-air values", which in practice might be such sensitive data, if memory is zeroed by the memory allocator but one thread still sees the old value. I believe the most up-to-date introduction, with a useful bibliography, is by S. Adve and H. Boehm [1]. Then you will probably want to read [2] and [3] about C++ and Java. Through Google Scholar, you should find you copies of all of this on CiteseerX. [1] Sarita V. Adve and Hans-J. Boehm. 2010. Memory models: a case for rethinking parallel languages and hardware. Communications of the ACM. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.158.4288 [2] Hans-J. Boehm and Sarita V. Adve. 2008. Foundations of the C++ concurrency memory model. PLDI '08 [3] Jeremy Manson, William Pugh, and Sarita V. Adve. 2005. The Java memory model. POPL '05.
For example, dict lookups, etc, could be relativistic in the absense of explicit synchronization, that is, dict updates are not quaranteed to propagate [fast] until [some] therad acquires a lock [or something].
Talking about dictionaries restricts the scope unnecessarily. You want to relax these requirement for memory operations in general.
Would this be a worthwhile departure from GIL-based monotonic world?
Years of research agree the answer is yes, no doubt. Otherwise several fundamental optimizations become impossible - including the usage of memory caches. To make a simple example, releasing a lock requires flushing pending writes (i.e. a cache miss, ~100 cycles nowadays). Implementing sequential consistency (what you term "monotonic world") means doing that after _each_ memory write. If you read Adve et Boehm [1], they'll elaborate more on the topic.
That is thread-safe code, that uses locks or some other mechanism [known to pypy] would see operations timely and in order, while code that doesn't synchronize its operations may see somewhat outdated shared state, as if threads executed in slightly different global order.
Yes. However, the programmer can't and shouldn't synchronize, say, field lookup, so those dictionaries (and only those) need to be thread-safe - Jython uses a ConcurrentHashMap for this purpose, even too that's too thread-safe (there needs to be no thread-safety guarantee on the values of fields).
About profiling, how do I distinguish between attribute vs globals vs builtins lookups in the profile?
Others can surely answer better, since I don't know the code; I guess I'd add a "type" field to the dictionary to store such information, or maybe to the call sites, depending on what's easier to annotate; but I also guess this naive solution could require lots of work.
d.
On 27 December 2010 12:50, Paolo Giarrusso <p.giarrusso@gmail.com> wrote:
On Mon, Dec 27, 2010 at 19:54, Dima Tisnek <dimaqq@gmail.com> wrote:
what do you think of liburcu? lttng.org/urcu
I've studied RCU in the Linux kernel (from which URCU derives) and thought for a long time of using it for this problem. In short, once you have GC, RCU (as in the Linux kernel) becomes (almost) trivial, because RCU is almost entirely about how to delay freeing objects. The RCU-GC connection is also mentioned by Documentation/RCU/RTFP.txt in the Linux sources.
So, you need to just read about persistent data structures, and remember that persistence is not the point - data structures which are just almost persistent can also be OK, if you get them right. If you replace a value in a dictionary, for instance, you don't necessarily need to keep the old version around, as long as you the value fits in a memory word so that it can be atomically replaced. http://hackerboss.com/copy-on-write-101-part-2-putting-it-to-use/
Speaking of libraries, you might want to look at this for the atomic primitives you'll need: www.hpl.hp.com/research/linux/atomic_ops/ I'm more used to the primitives of the Linux kernel, I remember I didn't fully like that library, but IIRC it was a matter of taste - the library is from Hans Boehm, which makes it trustworthy.
p.s. if/when I get around to profiling pypy, which is something I want to do, I'll be sure to share the results here.
Great, thanks!
d.
On 27 December 2010 10:57, Paolo Giarrusso <p.giarrusso@gmail.com> wrote:
On Mon, Dec 27, 2010 at 07:00, Dima Tisnek <dimaqq@gmail.com> wrote:
(dan) random binary trees: O(log2(n)) is 7.2 for builtins, though the constant factor for locks, etc, might make it worthwhile
"7.2"? What is that? Do you mean 7.2=log2(no. of builtins)?
(paolo) non blocking hash maps: memory barriers can be quite costly too
Well, you just _cannot_ avoid them, whatever is your fancy data structure. You can only move most of the cost (or all of it) to write operations - which is helpful if reads are most common, a case which you discuss in your mail.
== Why memory barriers are required (in more detail) == If you use persistent data structures, also called copy-on-write or purely functional (and that's the most obvious way to use a tree in a parallel setting), the whole synchronization problem is reduced to exchanging a pointer (to the root of the tree) between a writer and a reader thread. Let's consider this in the Java memory model (the state of the art). In this model the writer thread has to use an expensive StoreLoad barrier, i.e. mfence on x86 [1], costing more or less as much as a cache miss, while the reader needs just a cheap Load* barrier (for free on x86, cheap on other processor). If you don't use a volatile field, there is very little guarantee about what your program means, and it is almost always buggy; the equivalent in the new C++ memory model (which is otherwise very similar) gives completely undefined semantics (and even crashes are possible in practice, for instance if the reader thread sees a not-fully-initialized object because of the race and invokes a virtual method on it).
[1] http://g.oswego.edu/dl/jmm/cookbook.html
== Other stuff ==
different options will have to be implemented and tested when we get there, speaking on which, is there a test dict load? does anyone have some profiling data, what portion of runtime is spent reading and writing attributes and globals anyways?
I'd be interested as well in such data.
Anyhow, back to parallel interpretation, Is there an issue or page that tracks what's needed for parallel interpretation? so far mentioned: gc, dict, c api
Btw, I think that jit is more important at the moment, but time comes when jit juice has been mostly squeezed out ;-)
-- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
-- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
-- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
Hi Paolo, Although I'm sure that your post is very interesting with the right mindset, I have a problem understanding the connection with Python. I know that it's exactly what you are complaining about, i.e. that Python developers appear not to care much about this deep Java/C++ issue. However, if you mean to do anything about it, you need to first understand yourself why it is so -- and the following reason is by far not enough:
As a matter of fact, this work is unsurprisingly often totally ignored in many discussions about Python's memory model. I'm not surprised because it's complex stuff, and understanding it doesn't matter for today's Python programming.
The most obvious misunderstanding, in my opinion, is that in Java or C++ it's about fields reads and writes, whereas in Python it's about any operation on any built-in type -- which can be for example reading or writing attributes, or insert()ing in a list, or doing setdefault() on a dictionary, or... any complex operation. This means that you cannot *at all* reduce the problem to field reads and field writes. As long as the discussion at the level of Java or C++ is only about fields, it's going to be ignored as "mostly uninteresting" by the Python folks. On the other hand, the links you posted about nonblockinghashmap are indeed interesting in this context. From a Python point of view, what is left to discuss is whether these hash maps offer enough "consistency" behavior to be usable on Python's default memory model. Or maybe it's not that interesting any more now that PyPy tends not to use a lot of dictionaries any more (the attributes of objects are not in a hash map, internally). I am myself just expressing vague, uninformed opinions again. But the final point is that this really needs someone motivated to experiment with PyPy (or CPython), and as long as no-one shows up, it will mostly be just "moving air around", i.e. wasting time to discuss this too much in depth. A bientôt, Armin.
participants (8)
-
Antonio Cuni
-
Armin Rigo
-
Benjamin Peterson
-
Carl Friedrich Bolz
-
Dan Stromberg
-
Dima Tisnek
-
Paolo Giarrusso
-
William ML Leslie