[pypy-commit] extradoc extradoc: Finish the slides

Wed Jul 23 08:02:34 CEST 2014

Author: Armin Rigo <arigo at tunes.org>
Branch: extradoc
Changeset: r5372:3c5ba3e46ec5
Date: 2014-07-23 08:02 +0200
http://bitbucket.org/pypy/extradoc/changeset/3c5ba3e46ec5/

Log:	Finish the slides

diff --git a/talk/ep2014/stm/talk.html b/talk/ep2014/stm/talk.html
--- a/talk/ep2014/stm/talk.html
+++ b/talk/ep2014/stm/talk.html
@@ -502,40 +502,64 @@
 </li>
 </ul>
 </div>
-<div class="slide" id="big-point">
-<h1>Big Point</h1>
+<div class="slide" id="pypy-stm">
+<h1>PyPy-STM</h1>
 <ul class="simple">
-<li>application-level locks still needed...</li>
+<li>implementation of a specially-tailored STM ("hard" part):<ul>
+<li>a reusable C library</li>
+<li>called STMGC-C7</li>
+</ul>
+</li>
+<li>used in PyPy to replace the GIL ("easy" part)</li>
+<li>could also be used in CPython<ul>
+<li>but refcounting needs replacing</li>
+</ul>
+</li>
+</ul>
+</div>
+<div class="slide" id="how-does-it-work">
+<h1>How does it work?</h1>
+<object data="fig4.svg" type="image/svg+xml">
+fig4.svg</object>
+</div>
+<div class="slide" id="demo">
+<h1>Demo</h1>
+<ul class="simple">
+<li>counting primes</li>
+</ul>
+</div>
+<div class="slide" id="long-transactions">
+<h1>Long Transactions</h1>
+<ul class="simple">
+<li>threads and application-level locks still needed...</li>
 <li>but <em>can be very coarse:</em><ul>
-<li>even two big transactions can optimistically run in parallel</li>
+<li>two transactions can optimistically run in parallel</li>
 <li>even if they both <em>acquire and release the same lock</em></li>
+<li>internally, drive the transaction lengths by the locks we acquire</li>
 </ul>
 </li>
 </ul>
 </div>
 <div class="slide" id="id2">
-<h1>Big Point</h1>
+<h1>Long Transactions</h1>
 <object data="fig4.svg" type="image/svg+xml">
 fig4.svg</object>
 </div>
-<div class="slide" id="demo-1">
-<h1>Demo 1</h1>
+<div class="slide" id="id3">
+<h1>Demo</h1>
 <ul class="simple">
-<li>"Twisted apps made parallel out of the box"</li>
 <li>Bottle web server</li>
 </ul>
 </div>
-<div class="slide" id="pypy-stm">
-<h1>PyPy-STM</h1>
+<div class="slide" id="pypy-stm-programming-model">
+<h1>PyPy-STM Programming Model</h1>
 <ul class="simple">
-<li>implementation of a specially-tailored STM:<ul>
-<li>a reusable C library</li>
-<li>called STMGC-C7</li>
-</ul>
-</li>
-<li>used in PyPy to replace the GIL</li>
-<li>could also be used in CPython<ul>
-<li>but refcounting needs replacing</li>
+<li>threads-and-locks, fully compatible with the GIL</li>
+<li>this is not "everybody should use careful explicit threading
+with all the locking issues"</li>
+<li>instead, PyPy-STM pushes forward:<ul>
+<li>use a thread pool library</li>
+<li>coarse locking, inside that library only</li>
 </ul>
 </li>
 </ul>
@@ -546,7 +570,7 @@
 <li>current status:<ul>
 <li>basics work</li>
 <li>best case 25-40% overhead (much better than originally planned)</li>
-<li>parallelizing user locks not done yet</li>
+<li>parallelizing user locks not done yet (see "with atomic")</li>
 <li>tons of things to improve</li>
 <li>tons of things to improve</li>
 <li>tons of things to improve</li>
@@ -558,52 +582,113 @@
 </li>
 </ul>
 </div>
-<div class="slide" id="demo-2">
-<h1>Demo 2</h1>
+<div class="slide" id="summary-benefits">
+<h1>Summary: Benefits</h1>
 <ul class="simple">
-<li>counting primes</li>
-</ul>
-</div>
-<div class="slide" id="benefits">
-<h1>Benefits</h1>
-<ul class="simple">
-<li>Keep locks coarse-grained</li>
 <li>Potential to enable parallelism:<ul>
-<li>in CPU-bound multithreaded programs</li>
+<li>in any CPU-bound multithreaded program</li>
 <li>or as a replacement of <tt class="docutils literal">multiprocessing</tt></li>
 <li>but also in existing applications not written for that</li>
 <li>as long as they do multiple things that are "often independent"</li>
 </ul>
 </li>
+<li>Keep locks coarse-grained</li>
 </ul>
 </div>
-<div class="slide" id="issues">
-<h1>Issues</h1>
+<div class="slide" id="summary-issues">
+<h1>Summary: Issues</h1>
 <ul class="simple">
-<li>Performance hit: 25-40% everywhere (may be ok)</li>
 <li>Keep locks coarse-grained:<ul>
 <li>but in case of systematic conflicts, performance is bad again</li>
 <li>need to track and fix them</li>
-<li>need tool support (debugger/profiler)</li>
+<li>need tool to support this (debugger/profiler)</li>
 </ul>
 </li>
+<li>Performance hit: 25-40% over a plain PyPy-JIT (may be ok)</li>
 </ul>
 </div>
-<div class="slide" id="summary">
-<h1>Summary</h1>
+<div class="slide" id="summary-pypy-stm">
+<h1>Summary: PyPy-STM</h1>
 <ul class="simple">
-<li>Transactional Memory is still too researchy for production</li>
-<li>But it has the potential to enable "easier parallelism"</li>
+<li>Not production-ready</li>
+<li>But it has the potential to enable "easier parallelism for everybody"</li>
 <li>Still alpha but slowly getting there!<ul>
 <li>see <a class="reference external" href="http://morepypy.blogspot.com/">http://morepypy.blogspot.com/</a></li>
 </ul>
 </li>
+<li>Crowdfunding!<ul>
+<li>see <a class="reference external" href="http://pypy.org/">http://pypy.org/</a></li>
+</ul>
+</li>
 </ul>
 </div>
 <div class="slide" id="part-2-under-the-hood">
 <h1>Part 2 - Under The Hood</h1>
 <p><strong>STMGC-C7</strong></p>
 </div>
+<div class="slide" id="overview">
+<h1>Overview</h1>
+<ul class="simple">
+<li>Say we want to run N = 2 threads</li>
+<li>We reserve twice the memory</li>
+<li>Thread 1 reads/writes "memory segment" 1</li>
+<li>Thread 2 reads/writes "memory segment" 2</li>
+<li>Upon commit, we (try to) copy the changes to the other segment</li>
+</ul>
+</div>
+<div class="slide" id="trick-1">
+<h1>Trick #1</h1>
+<ul class="simple">
+<li>Objects contain pointers to each other</li>
+<li>These pointers are relative instead of absolute:<ul>
+<li>accessed as if they were "thread-local data"</li>
+<li>the x86 has a zero-cost way to do that (<tt class="docutils literal">%fs</tt>, <tt class="docutils literal">%gs</tt>)</li>
+<li>supported in clang (not gcc so far)</li>
+</ul>
+</li>
+</ul>
+</div>
+<div class="slide" id="trick-2">
+<h1>Trick #2</h1>
+<ul class="simple">
+<li>With Trick #1, most objects are exactly identical in all N segments:<ul>
+<li>so we share the memory</li>
+<li><tt class="docutils literal">mmap() MAP_SHARED</tt></li>
+<li>actual memory usage is multiplied by much less than N</li>
+</ul>
+</li>
+<li>Newly allocated objects are directly in shared pages:<ul>
+<li>we don't actually need to copy <em>all new objects</em> at commit,
+but only the few <em>old objects</em> modified</li>
+</ul>
+</li>
+</ul>
+</div>
+<div class="slide" id="barriers">
+<h1>Barriers</h1>
+<ul class="simple">
+<li>Need to record all reads and writes done by a transaction</li>
+<li>Extremely cheap way to do that:<ul>
+<li><em>Read:</em> set a flag in thread-local memory (one byte)</li>
+<li><em>Write</em> into a newly allocated object: nothing to do</li>
+<li><em>Write</em> into an old object: add the object to a list</li>
+</ul>
+</li>
+<li>Commit: check if each object from that list conflicts with
+a read flag set in some other thread</li>
+</ul>
+</div>
+<div class="slide" id="id4">
+<h1>...</h1>
+</div>
+<div class="slide" id="thank-you">
+<h1>Thank You</h1>
+<ul class="simple">
+<li><a class="reference external" href="http://morepypy.blogspot.com/">http://morepypy.blogspot.com/</a></li>
+<li><a class="reference external" href="http://pypy.org/">http://pypy.org/</a></li>
+<li>irc: <tt class="docutils literal">#pypy</tt> on freenode.net</li>
+</ul>
+</div>
 </div>
 </body>
 </html>
diff --git a/talk/ep2014/stm/talk.rst b/talk/ep2014/stm/talk.rst
--- a/talk/ep2014/stm/talk.rst
+++ b/talk/ep2014/stm/talk.rst
@@ -153,8 +153,8 @@
   - but refcounting needs replacing
 
 
-Commits
----------
+How does it work?
+-----------------
 
 .. image:: fig4.svg
 
@@ -165,17 +165,19 @@
 * counting primes
 
 
-Big Point
+Long Transactions
 ----------------------------
 
-* application-level locks still needed...
+* threads and application-level locks still needed...
 
 * but *can be very coarse:*
 
-  - even two big transactions can optimistically run in parallel
+  - two transactions can optimistically run in parallel
 
   - even if they both *acquire and release the same lock*
 
+  - internally, drive the transaction lengths by the locks we acquire
+
 
 Long Transactions
 -----------------
@@ -211,7 +213,7 @@
 
   - basics work
   - best case 25-40% overhead (much better than originally planned)
-  - parallelizing user locks not done yet (see ``with atomic``)
+  - parallelizing user locks not done yet (see "with atomic")
   - tons of things to improve
   - tons of things to improve
   - tons of things to improve
@@ -224,8 +226,6 @@
 Summary: Benefits
 -----------------
 
-* Keep locks coarse-grained
-
 * Potential to enable parallelism:
 
   - in any CPU-bound multithreaded program
@@ -236,6 +236,8 @@
 
   - as long as they do multiple things that are "often independent"
 
+* Keep locks coarse-grained
+
 
 Summary: Issues
 ---------------
@@ -248,7 +250,7 @@
 
   - need tool to support this (debugger/profiler)
 
-* Performance hit: 25-40% everywhere (may be ok)
+* Performance hit: 25-40% over a plain PyPy-JIT (may be ok)
 
 
 Summary: PyPy-STM
@@ -256,12 +258,16 @@
 
 * Not production-ready
 
-* But it has the potential to enable "easier parallelism"
+* But it has the potential to enable "easier parallelism for everybody"
 
 * Still alpha but slowly getting there!
 
   - see http://morepypy.blogspot.com/
 
+* Crowdfunding!
+
+  - see http://pypy.org/
+
 
 Part 2 - Under The Hood
 -----------------------
@@ -272,7 +278,7 @@
 Overview
 --------
 
-* Say we want to run two threads
+* Say we want to run N = 2 threads
 
 * We reserve twice the memory
 
@@ -290,16 +296,56 @@
 
 * These pointers are relative instead of absolute:
 
-  - 
+  - accessed as if they were "thread-local data"
 
+  - the x86 has a zero-cost way to do that (``%fs``, ``%gs``)
 
-Trick #1
+  - supported in clang (not gcc so far)
+
+
+Trick #2
 --------
 
-* Most objects are the same in all segments:
+* With Trick #1, most objects are exactly identical in all N segments:
 
   - so we share the memory
   
-  - ``mmap() MAP_SHARED`` trickery
+  - ``mmap() MAP_SHARED``
 
+  - actual memory usage is multiplied by much less than N
 
+* Newly allocated objects are directly in shared pages:
+    
+  - we don't actually need to copy *all new objects* at commit,
+    but only the few *old objects* modified
+
+
+Barriers
+--------
+
+* Need to record all reads and writes done by a transaction
+
+* Extremely cheap way to do that:
+
+  - *Read:* set a flag in thread-local memory (one byte)
+
+  - *Write* into a newly allocated object: nothing to do
+
+  - *Write* into an old object: add the object to a list
+
+* Commit: check if each object from that list conflicts with
+  a read flag set in some other thread
+
+
+...
+-------------------
+
+
+Thank You
+---------
+
+* http://morepypy.blogspot.com/
+
+* http://pypy.org/
+
+* irc: ``#pypy`` on freenode.net