[pypy-commit] extradoc extradoc: merge stm-edit (thanks matti)

Tue Apr 8 10:40:48 CEST 2014

Author: Armin Rigo <arigo at tunes.org>
Branch: extradoc
Changeset: r5192:d39f9507bbbe
Date: 2014-04-08 10:40 +0200
http://bitbucket.org/pypy/extradoc/changeset/d39f9507bbbe/

Log:	merge stm-edit (thanks matti)

diff --git a/planning/tmdonate2.txt b/planning/tmdonate2.txt
--- a/planning/tmdonate2.txt
+++ b/planning/tmdonate2.txt
@@ -49,36 +49,36 @@
 they can use the existing ``threading`` module, with its associated GIL
 and the complexities of real multi-threaded programming (locks,
 deadlocks, races, etc.), which make this solution less attractive.  The
-big alternative is for them to rely on one of various multi-process
-solutions that are outside the scope of the core language. All of them require a
-big restructuring of the program and often need extreme care and extra
+most attractive alternative for most developers is to rely on one of various multi-process
+solutions that are outside the scope of the core Python language. All of them require a
+major restructuring of the program and often need extreme care and extra
 knowledge to use them.
 
-The aim of this series of proposals is to research and implement
+We propose an implemention of
 Transactional Memory in PyPy.  This is a technique that recently came to
 the forefront of the multi-core scene.  It promises to offer multi-core CPU
-usage without requiring to fall back to the multi-process solutions
-described above, and also should allow to change the core of the event systems
-mentioned above to enable the use of multiple cores without the explicit use of
-the ``threading`` module by the user.
+usage in a single process.
+In particular, by modifying the core of the event systems
+mentioned above, we will enable the use of multiple cores, without the
+user needing to use explicitly the ``threading`` module.
 
 The first proposal was launched near the start of 2012 and has covered
-the fundamental research part, up to the point of getting a first
+much of the fundamental research, up to the point of getting a first
 version of PyPy working in a very roughly reasonable state (after
 collecting about USD$27'000, which is little more than half of the money
-that was asked; hence the present second call for donations).
+that was sought; hence the present second call for donations).
 
-This second proposal aims at fixing the remaining issues until we get a
-really good GIL-free PyPy (described in `goal 1`_ below); and then we
-will focus on the various new features needed to actually use multiple
+We now propose fixing the remaining issues to obtaining a
+really good GIL-free PyPy (described in `goal 1`_ below). We
+will then focus on the various new features needed to actually use multiple
 cores without explicitly using multithreading (`goal 2`_ below), up to
-and including adapting some existing framework libraries like for
+and including adapting some existing framework libraries, for
 example Twisted, Tornado, Stackless, or gevent (`goal 3`_ below).
 
 
 
-In more details
-===============
+In more detail
+==============
 
 This is a call for financial help in implementing a version of PyPy able
 to use multiple processors in a single process, called PyPy-TM; and
@@ -87,16 +87,17 @@
 Armin Rigo and Remi Meier and possibly others.
 
 We currently estimate the final performance goal to be a slow-down of
-25% to 40%, i.e. running a fully serial application would take between
-1.25 and 1.40x the time it takes in a regular PyPy.  (This goal has
+25% to 40% from the current non-TM PyPy; i.e. running a fully serial application would take between
+1.25 and 1.40x the time it takes in a regular PyPy.  This goal has
 been reached already in some cases, but we need to make this result more
-broadly applicable.)  We feel confident that it can work, in the
-following sense: the performance of PyPy-TM running any suitable
+broadly applicable.  We feel confident that we can reach this goal more
+generally: the performance of PyPy-TM running any suitable
 application should scale linearly or close-to-linearly with the number
 of processors.  This means that starting with two cores, such
-applications should perform better than in a regular PyPy.  (All numbers
+applications should perform better than a non-TM PyPy.  (All numbers
 presented here are comparing different versions of PyPy which all have
-the JIT enabled.)
+the JIT enabled.  A "suitable application" is one without many conflicts;
+see `goal 2`_.)
 
 You will find below a sketch of the `work plan`_.  If more money than
 requested is collected, then the excess will be entered into the general
@@ -148,8 +149,8 @@
 Software Transactional Memory (STM) library currently used inside PyPy
 with a much smaller Hardware Transactional Memory (HTM) library based on
 hardware features and running on Haswell-generation processors.  This
-has been attempted by Remi Meier recently.  However, it seems that we
-see scaling problems (as we expected them): the current generation of HTM
+has been attempted by Remi Meier recently.  However, it seems that it
+fails to scale as we would expect it to: the current generation of HTM
 processors is limited to run small-scale transactions.  Even the default
 transaction size used in PyPy-STM is often too much for HTM; and
 reducing this size increases overhead without completely solving the
@@ -162,15 +163,15 @@
 generally.  A CPU with support for the virtual memory described in this
 paper would certainly be better for running PyPy-HTM.
 
-Another issue is sub-cache-line false conflicts (conflicts caused by two
+Another issue in HTM is sub-cache-line false conflicts (conflicts caused by two
 independent objects that happens to live in the same cache line, which
 is usually 64 bytes).  This is in contrast with the current PyPy-STM,
 which doesn't have false conflicts of this kind at all and might thus be
-ultimately better for very-long-running transactions.  None of the
-papers we know of discusses this issue.
+ultimately better for very-long-running transactions.  We are not aware of
+published research discussing issues of sub-cache-line false conflicts.
 
 Note that right now PyPy-STM has false conflicts within the same object,
-e.g. within a list or a dictionary; but we can more easily do something
+e.g. within a list or a dictionary; but we can easily do something
 about it (see `goal 2_`).  Also, it might be possible in PyPy-HTM to
 arrange objects in memory ahead of time so that such conflicts are very
 rare; but we will never get a rate of exactly 0%, which might be
@@ -179,22 +180,23 @@
 .. _`Virtualizing Transactional Memory`: http://pages.cs.wisc.edu/~isca2005/papers/08A-02.PDF
 
 
-Why do it with PyPy instead of CPython?
+Why do TM with PyPy instead of CPython?
 ---------------------------------------
 
 While there have been early experiments on Hardware Transactional Memory
 with CPython (`Riley and Zilles (2006)`__, `Tabba (2010)`__), there has
-been no recent one.  The closest is an attempt using `Haswell on the
+been none in the past few years.  To the best of our knowledge,
+the closest is an attempt using `Haswell on the
 Ruby interpreter`__.  None of these attempts tries to do the same using
 Software Transactional Memory.  We would nowadays consider it possible
 to adapt our stmgc-c7 library for CPython, but it would be a lot of
-work, starting from changing the reference-counting scheme.  PyPy is
+work, starting from changing the reference-counting garbage collection scheme.  PyPy is
 better designed to be open to this kind of research.
 
-But the best argument from an external point of view is probably that
-PyPy has got a JIT to start with.  It is thus starting from a better
-position in terms of performance, particularly for the long-running kind
-of programs that we target here.
+However, the best argument from an objective point of view is probably
+that PyPy has already implemented a Just-in-Time compiler.  It is thus
+starting from a better position in terms of performance, particularly
+for the long-running kind of programs that we target here.
 
 .. __: http://sabi.net/nriley/pubs/dls6-riley.pdf
 .. __: http://www.cs.auckland.ac.nz/~fuad/parpycan.pdf
@@ -207,7 +209,7 @@
 PyPy-TM will be slower than judicious usage of existing alternatives,
 based on multiple processes that communicate with each other in one way
 or another.  The counter-argument is that TM is not only a cleaner
-solution: there are cases in which it is not doable to organize (or
+solution: there are cases in which it is not really possible to organize (or
 retrofit) an existing program into the particular format needed for the
 alternatives.  In particular, small quickly-written programs don't need
 the additional baggage of cross-process communication; and large
@@ -217,35 +219,35 @@
 rest of the program should work without changes.
 
 
-Other platforms than the x86-64 Linux
+Platforms other than the x86-64 Linux
 -------------------------------------
 
-The first thing to note is that the current solution depends on having a
-huge address space available.  If it were to be ported to any 32-bit
-architecture, the limitation to 2GB or 4GB of address space would become
-very restrictive: the way it works right now would further divide this
+The current solution depends on having a
+huge address space available.  Porting to any 32-bit
+architecture would quickly run into the limitation of a 2GB or 4GB of address space.
+The way TM works right now would further divide this
 limit by N+1, where N is the number of segments.  It might be possible
 to create partially different memory views for multiple threads that
-each access the same range of addresses; this would require extensions
-that are very OS-specific.  We didn't investigate so far.
+each access the same range of addresses; but this would likely require
+changes inside the OS.  We didn't investigate so far.
 
-The current version, which thus only works on 64-bit, still relies
+The current 64-bit version relies
 heavily on Linux- and clang-only features.  We believe it is a suitable
 restriction: a lot of multi- and many-core servers commonly available
 are nowadays x86-64 machines running Linux.  Nevertheless, non-Linux
 solutions appear to be possible as well.  OS/X (and likely the various
 BSDs) seems to handle ``mmap()`` better than Linux does, and can remap
 individual pages of an existing mapping to various pages without hitting
-a limit of 65536 like Linux.  Windows might also have a way, although we
-didn't measure yet; but the first issue with Windows would be to support
-Win64, which the regular PyPy doesn't.
+a limit of 65536 like Linux.  Windows might also have a solution, although we
+didn't measure yet; but first we would need a 64-bit Windows PyPy, which has
+not seen much active support.
 
-We will likely explore the OS/X way (as well as the Windows way if Win64
-support grows in PyPy), but this is not included in the scope of this
-proposal.
+We will likely explore the OS/X path (as well as the Windows path if Win64
+support grows in PyPy), but this is not part of this current
+donation proposal.
 
 It might be possible to adapt the work done on x86-64 to the 64-bit
-ARMv8 as well, but we didn't investigate so far.
+ARMv8 as well. We have not investigated this so far.
 
 
 More readings