mmap thoughts

Fri May 11 19:09:37 EDT 2007

 I've been thinking about the Python mmap module quite a bit
 during the last couple of days.  Sadly most of it has just been
 thinking ... and reading pages from Google searches ... and
 very little of it as been coding.

 Mostly it's just academic curiosity (I might be teaching an "overview 
 of programming" class in a few months, and I'd use Python for most 
 of the practical examples to cover a broad range of programming
 topics, including the whole concept of memory mapping used, on the 
 one hand, as a file access abstraction and as a form of inter-process 
 shared memory, on the other).

 Initial observations:

   * The standard library reference could use some good examples.
     At least of those should show use of both anonymous and
     named mmap objects as shared memory.

   * On Linux (various versions) using Python 2.4.x (for at
     least 2.4.4 and 2.4.2) if I create on mmap'ing in one
     process, then open the file using 'w' or 'w+' or 'w+b'
     in another process then my first process dies with "Bus Error"

     This should probably be documented.

     (It's fine if I use 'a' (append) modes for opening the file).

   * It seems that it's also necessary to extend a file to a given
     size before creating a mapping on it.  In other words you can't
     mmap a newly created, 0-length file. 

     So it seems like the simplest example of a newly created,
     non-anonymous file mapping would be something like:

          sz = (1024 * 1024 * 1024 * 2 ) - 1
          f=open('/tmp/mmtst.tmp','w+b')
          f.seek(sz)
          f.write('\0')
          f.flush()
          mm = mmap.mmap(f.fileno(), sz, mmap.MAP_SHARED)
          f.close()

     Even creating a zero length file and trying to create a
     zero-length mapping on it (with mmap(f.fileno(),0,...)
     ... with a mind towards using mmap's .resize() method on it
     doesn't work.  (raises: EnvironmentError: "Errno 22: Invalid 
     Argument").  BTW: the call to f.flush() does seem to be
     required at least in my environments (Linux under 2.6 kernels
     various distributions and the aforementioned 2.4.2 and 2.4.4
     versions of Python.

   * The mmtst.tmp file is "sparse" of course.  So its size in
     the example above is 2GB ... but the disk usage (du command)
     on it is only a few KB (depending on your filesystem cluster
     size etc).

   * Using a function like len(mm[:]) forces the kernel's filesystem
     to return a huge stream of NUL characters.  (And might thrash
     your system caches a little).

   * On my SuSE/Novell 10.1 system, using Python 2.4.2 (their RPM
     2.4.2-18) I found that anonymous mmaps would raise an
     EnvironmentError.  Using the same code on 2.4.4 on my Debian
     and Fedora Core 6 system worked with no problem:

          anonmm == mmap.mmap(-1,4096,mmap.MAP_ANONYMOUS|mmap.MAP_SHARED)

     ... and also testing on their 2.4.2-18.5 update with the same
     results:

Python 2.4.2 (#1, Oct 13 2006, 17:11:24) 
[GCC 4.1.0 (SUSE Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import mmap
>>> mm = mmap.mmap(-1,4096, mmap.MAP_ANONYMOUS|mmap.MAP_SHARED)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
EnvironmentError: [Errno 22] Invalid argument
>>> 
jadestar at dhcphostname:~> uname -a
Linux dhcphostname 2.6.16.13-4-default #1 Wed May 3 ...

   * On the troublesome SuSE/Novell box using:

          f = open('/dev/zero','w+')
          anonmm == mmap.mmap(f.fileno(),4096, 
             mmap.MAP_ANONYMOUS|mmap.MAP_SHARED)

     ... seems to work.  However, a .resize() on that raises the same
     EnvironmentError I was getting before.

   * As noted in a few discussions in the past Python's mmap() function
     doesn't take an "offset" parameter ... it always uses an offset of 0
     (It seems like a patch is slated for inclusion in some future release?)

   * On 32-bit Linux systems (or on systems running a 32-bit compilation
     of Python) 2GB is, of course, the upper limit of an mmap'ing
     The ability to map portions of larger files is a key motivation to
     include the previously mentioned "offset" patch.

 Other thoughts:

   (Going beyond initial observations, now)

   * I haven't tested this, but I presume that anonymous|shared mappings
     on UNIX can only be shared with child/descendant processes ... since 
     there's no sort of "handle" or "key" that can be passed to unrelated
     processes via any other IPC method; so only fork() based inheritence
     will work.

   * Another thing I haven't tested, yet: how robust are shared mappings 
     to adjacent/non-overlapping concurrent writes by multiple processes?
     I'm hoping that you can reliably have processes writing updates to 
     small, pre-assigned, blocks in the mmap'ing without contention issues.

     I plan to write some "hammer test" code to test this theory
     ... and run it for awhile on a few multi-core/SMP systems.

   * It would be nice to building something like the threading Queue
     and/or POSH support for multi-process support over nothing but
     pure Python (presumably using the mmap module to pass serialized
     objects around).

   * The limitations of Python threading and the GIL for scaling on
     SMP and multi-core system are notorious; having a first-class,
     and reasonably portable standard library for supporting multi-PROCESS
     scaling would be of tremendous benefit now that such MP systems
     are becoming the norm.

   * There don't seem to be any currently maintained SysV IPC
     (shm, message, and semaphore) modules for Python.  I guess some
     people have managed to hack something together using ctypes;
     but I haven't actually read, much less tested, any of that code.

   * The key to robust and efficient use of shared memory is going to be
     in the design and implementation of locking primitives for using it.

   * I'm guessing that some additional IPC method will be required to
     co-ordinate the locking --- something like Unix domain sockets,
     perhaps.  At least I think it would be a practically unavoidable
     requirement for unrelated process to share memory.

   * For related processes I could imagine a scheme whereby the parent
     of each process passes a unique "mailbox" offset to each child
     and where that might be used to implement a locking scheme.

     It might work something like this:

           Master process (parent) creates mapping and initializes
           a lock mm[0:4] a child counter and a counter (also at
           pre-defined offsets) and a set of mailboxes (set to the 
           max-number of children, or using a blocks of the mm in a 
           linked-list).

           For each sub-process (fork()'d child) the master increments
           the counter, and passes a mailbox offset (counter + current
           mailbox offset) to it.

           The master then goes into a loop, scanning the mailboxes
           (or goes idle with a SIGUSR* handler that scans the mailboxes)

           Whenever there are any non-empty mailboxes the master appends
           corresponding PIDs to a lock-request queue; then it writes pops 
           those PIDs and writes them into "lock" offset at mm[0] (perhaps
           sending a SIGUSR* to the new lock holder, too).  

           That process now has the lock and can work on the shared memory
           When it's done it would clear the lock and signal the master

           All processes have read access to the memory while it's not
           locked.  However, they have to read the lock counter first,
           copy the data into their own address, then verify that the 
           lock counter has not be incremented in the interim.  (All
           reads are double-checked to ensure that no changes could
           have occurred during the copying).  

  ... there are alot of details I haven't considered about such a
  scheme (I'm sure they'll come up if I prototype such a system).

  Obvious one could envision more complex data structures which
  essentially create a sort of shared "filesystem" in the shared
  memory ... where the "master" process is analogous to the filesystem
  "driver" for it.  Interesting cases for handling dead processes
  come up (the master could handle SIGCHLD by clearing locks held by
  the dearly departed) ... and timeouts/catatonic processes might be
  defined (master process kills the child before forcibly removing the
  lock).  Recovery of the last of the "master" process might be
  possible (define a portion of the shared memory pool that holds the
  list of processes who become the new master ... first living one on
  that list assume control).  But that raises new issues (can't depend
  on SIGCHLD in such a scheme checking for living processes would
  have to be done via kill 0 calls for example).  

  It's easy to see how complicated all this could become.  The question
  is, how simple could we make it and still have something useful?

-- 
Jim Dennis,
Starshine: Signed, Sealed, Delivered