Adding `offset` argument to np.lib.format.open_memmap and np.load
![](https://secure.gravatar.com/avatar/5d370232b4ed32caac8bba5672893bfd.jpg?s=120&d=mm&r=g)
https://github.com/jonovik/numpy/compare/master...offset_memmap The `offset` argument to np.memmap enables memory-mapping a portion of a file on disk to a memory-mapped Numpy array. Memory-mapping can also be done with np.load, but it uses np.lib.format.open_memmap, which has no offset argument. I have added an offset argument to np.lib.format.open_memmap and np.load as detailed in the link above, and humbly submit the changes for review. This is my first time using git, apologies for any mistakes. Note that the offset is in terms of array elements, not bytes (which is what np.memmap uses), because that was what I had use for. Also, I added a `shape` to np.load to memory-map only a portion of a file. My use case was to preallocate a big record array on disk, then start many processes writing to their separate, memory-mapped segments of the file. The end result was one big array on disk, with the correct shape and data type information. Using a record array makes the data structure more self- documenting. Using open_memmap with mode="w+" is the fastest way I've found to preallocate an array on disk; it does not create the huge array in memory. Letting multiple processes memory-map and read/write to non-overlapping portions without interfering with each other allows for fast, simple parallel I/ O. I've used this extensively on Numpy 1.4.0, but based my Git checkout on the current Numpy trunk. There have been some rearrangements in np.load since then (it used to be in np.lib.io and is now in np.lib.npyio), but as far as I can see, my modifications carry over fine. I haven't had a chance to test with Numpy trunk, though. (What is the best way to set up a test version without affecting my working 1.4.0 setup?) Hope this can be useful, Jon Olav Vik
![](https://secure.gravatar.com/avatar/5d370232b4ed32caac8bba5672893bfd.jpg?s=120&d=mm&r=g)
Jon Olav Vik <jonovik <at> gmail.com> writes:
https://github.com/jonovik/numpy/compare/master...offset_memmap
I've used this extensively on Numpy 1.4.0, but based my Git checkout on the current Numpy trunk. There have been some rearrangements in np.load since then (it used to be in np.lib.io and is now in np.lib.npyio), but as far as I can see, my modifications carry over fine. I haven't had a chance to test with Numpy trunk, though. (What is the best way to set up a test version without affecting my working 1.4.0 setup?)
I tried to push my modifications for 1.4.0, but couldn't figure out how my Github account could hold forks of both Numpy trunk and maintenance/1.4.x. Anyhow, here is a patch for 1.4:
From c3ff71637c6c00d6cac1ee22a2cad34de2449431 Mon Sep 17 00:00:00 2001 From: Jon Olav Vik <jonovik@gmail.com> Date: Thu, 24 Feb 2011 17:38:03 +0100 Subject: [PATCH 54/54] Added `offset` parameter as in np.memmap to np.load and np.lib.format.open_memmap. Modified numpy/lib/format.py Modified numpy/lib/io.py
Doctests:
filename = "temp.npy" np.save(filename, np.arange(10)) load(filename) array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) mmap = load(filename, mmap_mode="r+") mmap memmap([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) mmap[3:7] = 42 del mmap np.load(filename) array([ 0, 1, 2, 42, 42, 42, 42, 7, 8, 9]) mmap = load(filename, mmap_mode="r+", offset=2, shape=6) mmap[-1] = 123 del mmap np.load(filename) array([ 0, 1, 2, 42, 42, 42, 42, 123, 8, 9]) import os os.remove(filename)
numpy/lib/format.py | 17 +++++++++++++---- numpy/lib/io.py | 7 +++++-- 2 files changed, 18 insertions(+), 6 deletions(-) diff --git a/numpy/lib/format.py b/numpy/lib/format.py index 3c5fe32..7c28b09 100644 --- a/numpy/lib/format.py +++ b/numpy/lib/format.py @@ -460,7 +460,7 @@ def read_array(fp): def open_memmap(filename, mode='r+', dtype=None, shape=None, - fortran_order=False, version=(1,0)): + fortran_order=False, version=(1,0), offset=0): """ Open a .npy file as a memory-mapped array. @@ -479,13 +479,15 @@ def open_memmap(filename, mode='r+', dtype=None, shape=None, mode. shape : tuple of int, optional The shape of the array if we are creating a new file in "write" - mode. + mode. Shape of (contiguous) slice if opening an existing file. fortran_order : bool, optional Whether the array should be Fortran-contiguous (True) or C-contiguous (False) if we are creating a new file in "write" mode. version : tuple of int (major, minor) If the mode is a "write" mode, then this is the version of the file format used to create the file. + offset : int, optional + Number of elements to skip along the first dimension. Returns ------- @@ -509,6 +511,7 @@ def open_memmap(filename, mode='r+', dtype=None, shape=None, " existing file handles.") if 'w' in mode: + assert offset == 0, "Cannot specify offset when creating memmap" # We are creating the file, not reading it. # Check if we ought to create the file. if version != (1, 0): @@ -541,11 +544,17 @@ def open_memmap(filename, mode='r+', dtype=None, shape=None, if version != (1, 0): msg = "only support version (1,0) of file format, not %r" raise ValueError(msg % (version,)) - shape, fortran_order, dtype = read_array_header_1_0(fp) + fullshape, fortran_order, dtype = read_array_header_1_0(fp) + if shape is None: + shape = fullshape + if offset: + shape = list(fullshape) + shape[0] = shape[0] - offset + shape = tuple(shape) if dtype.hasobject: msg = "Array can't be memory-mapped: Python objects in dtype." raise ValueError(msg) - offset = fp.tell() + offset = fp.tell() + offset * dtype.itemsize finally: fp.close() diff --git a/numpy/lib/io.py b/numpy/lib/io.py index 262d20d..694bae2 100644 --- a/numpy/lib/io.py +++ b/numpy/lib/io.py @@ -212,7 +212,7 @@ class NpzFile(object): return self.files.__contains__(key) -def load(file, mmap_mode=None): +def load(file, mmap_mode=None, offset=0, shape=None): """ Load a pickled, ``.npy``, or ``.npz`` binary file. @@ -272,6 +272,9 @@ def load(file, mmap_mode=None): memmap([4, 5, 6]) """ + if (not mmap_mode) and (offset or shape): + raise ValueError("Offset and shape should be used only with mmap_mode") + import gzip if isinstance(file, basestring): @@ -290,7 +293,7 @@ def load(file, mmap_mode=None): return NpzFile(fid) elif magic == format.MAGIC_PREFIX: # .npy file if mmap_mode: - return format.open_memmap(file, mode=mmap_mode) + return open_memmap(file, mode=mmap_mode, shape=shape, offset=offset) else: return format.read_array(fid) else: # Try a pickle -- 1.7.4.msysgit.0
![](https://secure.gravatar.com/avatar/3d3176cf99cae23d0ac119d1ea6c4d11.jpg?s=120&d=mm&r=g)
Hi Jon, Thanks for the patch, and sorry for the slow reply. On Thu, Feb 24, 2011 at 11:49 PM, Jon Olav Vik <jonovik@gmail.com> wrote:
https://github.com/jonovik/numpy/compare/master...offset_memmap
The `offset` argument to np.memmap enables memory-mapping a portion of a file on disk to a memory-mapped Numpy array. Memory-mapping can also be done with np.load, but it uses np.lib.format.open_memmap, which has no offset argument.
I have added an offset argument to np.lib.format.open_memmap and np.load as detailed in the link above, and humbly submit the changes for review. This is my first time using git, apologies for any mistakes.
My first question after looking at this is why we would want three very similar ways to load memory-mapped arrays (np.memmap, np.load, np.lib.format.open_memmap)? They already exist but your changes make those three even more similar. I'd think we want one simple version (load) and a full-featured one. So imho changing open_memmap but leaving np.load as-is would be the way to go.
Note that the offset is in terms of array elements, not bytes (which is what np.memmap uses), because that was what I had use for.
This should be kept in bytes like in np.memmap I think, it's just confusing if those two functions differ like that. Another thing to change: you should not use an assert statement, use "if ....: raise ...." instead.
My use case was to preallocate a big record array on disk, then start many processes writing to their separate, memory-mapped segments of the file. The end result was one big array on disk, with the correct shape and data type information. Using a record array makes the data structure more self- documenting. Using open_memmap with mode="w+" is the fastest way I've found to preallocate an array on disk; it does not create the huge array in memory. Letting multiple processes memory-map and read/write to non-overlapping portions without interfering with each other allows for fast, simple parallel I/ O.
I've used this extensively on Numpy 1.4.0, but based my Git checkout on the current Numpy trunk. There have been some rearrangements in np.load since then (it used to be in np.lib.io and is now in np.lib.npyio), but as far as I can see, my modifications carry over fine. I haven't had a chance to test with Numpy trunk, though. (What is the best way to set up a test version without affecting my working 1.4.0 setup?)
You can use an in-place build (http://projects.scipy.org/numpy/wiki/DevelopmentTips) and add that dir to your PYTHONPATH. Cheers, Ralf
![](https://secure.gravatar.com/avatar/5d370232b4ed32caac8bba5672893bfd.jpg?s=120&d=mm&r=g)
Ralf Gommers <ralf.gommers <at> googlemail.com> writes:
My first question after looking at this is why we would want three very similar ways to load memory-mapped arrays (np.memmap, np.load, np.lib.format.open_memmap)? They already exist but your changes make those three even more similar.
If I understand correctly, np.memmap requires you to specify the dtype. It cannot of itself memory-map a file that has been saved with e.g. np.save. A file with (structured, in my case) dtype information is much more self- documenting than a plain binary file with no dtype. Functions for further processing of the data need then only read the file to know how to interpret it.
I'd think we want one simple version (load) and a full-featured one. So imho changing open_memmap but leaving np.load as-is would be the way to go.
np.load calls open_memmap if mmap_mode is specified. The only change I made to np.load was add offset and shape parameters that are passed through to open_memmap. (A shape argument is required for offset to be really useful, at least for my use case of multiple processes memory-mapping their own portion of a file.)
Note that the offset is in terms of array elements, not bytes (which is what np.memmap uses), because that was what I had use for.
This should be kept in bytes like in np.memmap I think, it's just confusing if those two functions differ like that.
I agree that there is room for confusion, but it is quite inconvenient having to access the file once for the dtype, then compute the offset based on the item size of the dtype, then access the file again for the real memory-mapping. The single most common scenario for me is "I am process n out of N, and I will memory-map my fair share of this file". Given n and N, I can compute the offset and shape in terms of array elements, but converting it to bytes means a couple extra lines of code every time I do it. (If anything, I'd prefer the offset argument for np.memmap to be in elements too, in accordance with how indexing and striding works with Numpy arrays.)
Another thing to change: you should not use an assert statement, use "if ....: raise ...." instead.
My use case was to preallocate a big record array on disk, then start many processes writing to their separate, memory-mapped segments of the file. The end result was one big array on disk, with the correct shape and data type information. Using a record array makes the data structure more self- documenting. Using open_memmap with mode="w+" is the fastest way I've found to preallocate an array on disk; it does not create the huge array in memory. Letting multiple processes memory-map and read/write to non-overlapping portions without interfering with each other allows for fast, simple
O.
I've used this extensively on Numpy 1.4.0, but based my Git checkout on the current Numpy trunk. There have been some rearrangements in np.load since
Will do if this gets support. Thanks for the feedback 8-) parallel I/ then
(it used to be in np.lib.io and is now in np.lib.npyio), but as far as I can see, my modifications carry over fine. I haven't had a chance to test with Numpy trunk, though. (What is the best way to set up a test version without affecting my working 1.4.0 setup?)
You can use an in-place build (http://projects.scipy.org/numpy/wiki/DevelopmentTips) and add that dir to your PYTHONPATH.
Very helpful. Thanks again! Regards, Jon Olav
![](https://secure.gravatar.com/avatar/764323a14e554c97ab74177e0bce51d4.jpg?s=120&d=mm&r=g)
On Thu, Feb 24, 2011 at 09:49, Jon Olav Vik <jonovik@gmail.com> wrote:
My use case was to preallocate a big record array on disk, then start many processes writing to their separate, memory-mapped segments of the file. The end result was one big array on disk, with the correct shape and data type information. Using a record array makes the data structure more self- documenting. Using open_memmap with mode="w+" is the fastest way I've found to preallocate an array on disk; it does not create the huge array in memory. Letting multiple processes memory-map and read/write to non-overlapping portions without interfering with each other allows for fast, simple parallel I/ O.
You can have each of those processes memory-map the whole file and just operate on their own slices. Your operating system's virtual memory manager should handle all of the details for you. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
![](https://secure.gravatar.com/avatar/86ea939a72cee216b3c076b52f48f338.jpg?s=120&d=mm&r=g)
Den 01.03.2011 01:15, skrev Robert Kern:
You can have each of those processes memory-map the whole file and just operate on their own slices. Your operating system's virtual memory manager should handle all of the details for you.
Mapping large files from the start will not always work on 32-bit systems. That is why mmap.mmap take an offset argument now (Python 2.7 and 3.1.) Making a view np.memmap with slices is useful on 64-bit but not 32-bit systems. Sturla
![](https://secure.gravatar.com/avatar/86ea939a72cee216b3c076b52f48f338.jpg?s=120&d=mm&r=g)
Den 01.03.2011 01:50, skrev Sturla Molden:
Mapping large files from the start will not always work on 32-bit systems. That is why mmap.mmap take an offset argument now (Python 2.7 and 3.1.)
Also, numpy.memmap is stricty speaking not needed. One can just use mmap.mmap and pass it to np.frombuffer. Sturla
![](https://secure.gravatar.com/avatar/764323a14e554c97ab74177e0bce51d4.jpg?s=120&d=mm&r=g)
On Mon, Feb 28, 2011 at 18:50, Sturla Molden <sturla@molden.no> wrote:
Den 01.03.2011 01:15, skrev Robert Kern:
You can have each of those processes memory-map the whole file and just operate on their own slices. Your operating system's virtual memory manager should handle all of the details for you.
Mapping large files from the start will not always work on 32-bit systems. That is why mmap.mmap take an offset argument now (Python 2.7 and 3.1.)
Making a view np.memmap with slices is useful on 64-bit but not 32-bit systems.
I'm talking about the OP's stated use case where he generates the file via memory-mapping the whole thing on the same machine. The whole file does fit into the address space in his use case. I'd like to see a real use case where this does not hold. I suspect that this is not the API we would want for such use cases. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
![](https://secure.gravatar.com/avatar/5d370232b4ed32caac8bba5672893bfd.jpg?s=120&d=mm&r=g)
Robert Kern <robert.kern <at> gmail.com> writes:
On Mon, Feb 28, 2011 at 18:50, Sturla Molden <sturla <at> molden.no> wrote:
Den 01.03.2011 01:15, skrev Robert Kern:
You can have each of those processes memory-map the whole file and just operate on their own slices. Your operating system's virtual memory manager should handle all of the details for you.
Wow, I didn't know that. So as long as the ranges touched by each process do not overlap, I'll be safe? If I modify only a few discontiguous chunks in a range, will the virtual memory manager decide whether it is most efficient to write just the chunks or the entire range back to disk?
Mapping large files from the start will not always work on 32-bit systems. That is why mmap.mmap take an offset argument now (Python 2.7 and 3.1.)
Making a view np.memmap with slices is useful on 64-bit but not 32-bit systems.
I'm talking about the OP's stated use case where he generates the file via memory-mapping the whole thing on the same machine. The whole file does fit into the address space in his use case.
I'd like to see a real use case where this does not hold. I suspect that this is not the API we would want for such use cases.
Use case: Generate "large" output for "many" parameter scenarios. 1. Preallocate "enormous" output file on disk. 2. Each process fills in part of the output. 3. Analyze, aggregate results, perhaps save to HDF or database, in a sliding- window fashion using a memory-mapped array. The aggregated results fit in memory, even though the raw output doesn't. My real work has been done on a 64-bit cluster running 64-bit Python, but I'd like to have the option of post-processing on my laptop's 32-bit Python (either spending a few hours copying the file to my laptop first, or mounting the remote disk using e.g. ExpanDrive). Maybe that is impossible with 32-bit Python: at least I cannot allocate that big a file on my laptop.
m = np.lib.format.open_memmap("c:/temp/temp.npy", "w+", dtype=np.int8, shape=2**33)
Traceback (most recent call last): File "<ipython console>", line 1, in <module> File "C:\Python26\lib\site-packages\numpy\lib\format.py", line 563, in open_memmap mode=mode, offset=offset) File "C:\Python26\lib\site-packages\numpy\core\memmap.py", line 221, in __new__ mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start) OverflowError: cannot fit 'long' into an index-sized integer
![](https://secure.gravatar.com/avatar/764323a14e554c97ab74177e0bce51d4.jpg?s=120&d=mm&r=g)
On Tue, Mar 1, 2011 at 07:20, Jon Olav Vik <jonovik@gmail.com> wrote:
Robert Kern <robert.kern <at> gmail.com> writes:
On Mon, Feb 28, 2011 at 18:50, Sturla Molden <sturla <at> molden.no> wrote:
Den 01.03.2011 01:15, skrev Robert Kern:
You can have each of those processes memory-map the whole file and just operate on their own slices. Your operating system's virtual memory manager should handle all of the details for you.
Wow, I didn't know that. So as long as the ranges touched by each process do not overlap, I'll be safe? If I modify only a few discontiguous chunks in a range, will the virtual memory manager decide whether it is most efficient to write just the chunks or the entire range back to disk?
It's up to the virtual memory manager, but usually, it will just load those pages (chunks the size of mmap.PAGESIZE) that are touched by your request and write them back.
Mapping large files from the start will not always work on 32-bit systems. That is why mmap.mmap take an offset argument now (Python 2.7 and 3.1.)
Making a view np.memmap with slices is useful on 64-bit but not 32-bit systems.
I'm talking about the OP's stated use case where he generates the file via memory-mapping the whole thing on the same machine. The whole file does fit into the address space in his use case.
I'd like to see a real use case where this does not hold. I suspect that this is not the API we would want for such use cases.
Use case: Generate "large" output for "many" parameter scenarios. 1. Preallocate "enormous" output file on disk. 2. Each process fills in part of the output. 3. Analyze, aggregate results, perhaps save to HDF or database, in a sliding- window fashion using a memory-mapped array. The aggregated results fit in memory, even though the raw output doesn't.
My real work has been done on a 64-bit cluster running 64-bit Python, but I'd like to have the option of post-processing on my laptop's 32-bit Python (either spending a few hours copying the file to my laptop first, or mounting the remote disk using e.g. ExpanDrive).
Okay, in this case, I don't think that just adding an offset argument to np.load() is very useful. You will want to read the dtype and shape information from the header, *then* decide what offset and shape to use for the memory-mapped segment. You will want to use the functions read_magic() and read_array_header_1_0() from np.lib.format directly. You can slightly modify the logic in open_memmap(): # Read the header of the file first. fp = open(filename, 'rb') try: version = read_magic(fp) if version != (1, 0): msg = "only support version (1,0) of file format, not %r" raise ValueError(msg % (version,)) shape, fortran_order, dtype = read_array_header_1_0(fp) if dtype.hasobject: msg = "Array can't be memory-mapped: Python objects in dtype." raise ValueError(msg) offset = fp.tell() finally: fp.close() chunk_offset, chunk_shape = decide_offset_shape(dtype, shape, fortran_order, offset) marray = np.memmap(filename, dtype=dtype, shape=chunk_shape, order=('F' if fortran_order else 'C'), mode='r+', offset=chunk_offset) What might help is combining the first stanza of logic together into one read_header() function that returns the usual information and also the offset to the actual data. That lets you avoid replicating the logic for handling different format versions. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
![](https://secure.gravatar.com/avatar/5d370232b4ed32caac8bba5672893bfd.jpg?s=120&d=mm&r=g)
Robert Kern <robert.kern <at> gmail.com> writes:
You can have each of those processes memory-map the whole file and just operate on their own slices. Your operating system's virtual memory manager should handle all of the details for you.
Wow, I didn't know that. So as long as the ranges touched by each process do not overlap, I'll be safe? If I modify only a few discontiguous chunks in a range, will the virtual memory manager decide whether it is most efficient to write just the chunks or the entire range back to disk?
It's up to the virtual memory manager, but usually, it will just load those pages (chunks the size of mmap.PAGESIZE) that are touched by your request and write them back.
What if two processes touch adjacent chunks that are smaller than a page? Is there a risk that writing back an entire page will overwrite the efforts of another process?
Use case: Generate "large" output for "many" parameter scenarios. 1. Preallocate "enormous" output file on disk. 2. Each process fills in part of the output. 3. Analyze, aggregate results, perhaps save to HDF or database, in a sliding- window fashion using a memory-mapped array. The aggregated results fit in memory, even though the raw output doesn't. [...] Okay, in this case, I don't think that just adding an offset argument to np.load() is very useful. You will want to read the dtype and shape information from the header, *then* decide what offset and shape to use for the memory-mapped segment. You will want to use the functions read_magic() and read_array_header_1_0() from np.lib.format directly.
Pardon me if I misunderstand, but isn't that what np.load does already, with or without my modifications? The existing np.load calls open_memmap if memory- mapping is requested. open_memmap does read the header first, using read_magic getting the shape and dtype from read_array_header_1_0(). It currently computes passes offset=fp.tell() to numpy.memmap. I just modify this offset based on the number of items to skip and the dtype's item size.
You can slightly modify the logic in open_memmap():
# Read the header of the file first. fp = open(filename, 'rb') try: version = read_magic(fp) if version != (1, 0): msg = "only support version (1,0) of file format, not %r" raise ValueError(msg % (version,)) shape, fortran_order, dtype = read_array_header_1_0(fp) if dtype.hasobject: msg = "Array can't be memory-mapped: Python objects in dtype." raise ValueError(msg) offset = fp.tell() finally: fp.close()
chunk_offset, chunk_shape = decide_offset_shape(dtype, shape, fortran_order, offset)
marray = np.memmap(filename, dtype=dtype, shape=chunk_shape, order=('F' if fortran_order else 'C'), mode='r+', offset=chunk_offset)
To me this seems very equivalent to what my hack is doing, https://github.com/ jonovik/numpy/compare/master...offset_memmap. I guess decide_offset_shape() would encapsulate the gist of what I added.
![](https://secure.gravatar.com/avatar/764323a14e554c97ab74177e0bce51d4.jpg?s=120&d=mm&r=g)
On Tue, Mar 1, 2011 at 15:36, Jon Olav Vik <jonovik@gmail.com> wrote:
Robert Kern <robert.kern <at> gmail.com> writes:
You can have each of those processes memory-map the whole file and just operate on their own slices. Your operating system's virtual memory manager should handle all of the details for you.
Wow, I didn't know that. So as long as the ranges touched by each process do not overlap, I'll be safe? If I modify only a few discontiguous chunks in a range, will the virtual memory manager decide whether it is most efficient to write just the chunks or the entire range back to disk?
It's up to the virtual memory manager, but usually, it will just load those pages (chunks the size of mmap.PAGESIZE) that are touched by your request and write them back.
What if two processes touch adjacent chunks that are smaller than a page? Is there a risk that writing back an entire page will overwrite the efforts of another process?
I believe that there is only one page in main memory. Each process is simply pointed to the same page. As long as you don't write to the same specific byte, you'll be fine.
Use case: Generate "large" output for "many" parameter scenarios. 1. Preallocate "enormous" output file on disk. 2. Each process fills in part of the output. 3. Analyze, aggregate results, perhaps save to HDF or database, in a sliding- window fashion using a memory-mapped array. The aggregated results fit in memory, even though the raw output doesn't. [...] Okay, in this case, I don't think that just adding an offset argument to np.load() is very useful. You will want to read the dtype and shape information from the header, *then* decide what offset and shape to use for the memory-mapped segment. You will want to use the functions read_magic() and read_array_header_1_0() from np.lib.format directly.
Pardon me if I misunderstand, but isn't that what np.load does already, with or without my modifications?
With your modifications, the user does not get to see the header information before they pick the offset and shape. I contend that the user ought to read the shape information before deciding the shape to use. I don't think that changing the no.load() API is the best way to solve this problem. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
![](https://secure.gravatar.com/avatar/5d370232b4ed32caac8bba5672893bfd.jpg?s=120&d=mm&r=g)
Robert Kern <robert.kern <at> gmail.com> writes:
It's up to the virtual memory manager, but usually, it will just load those pages (chunks the size of mmap.PAGESIZE) that are touched by your request and write them back.
What if two processes touch adjacent chunks that are smaller than a page? Is there a risk that writing back an entire page will overwrite the efforts of another process?
I believe that there is only one page in main memory. Each process is simply pointed to the same page. As long as you don't write to the same specific byte, you'll be fine.
Within a single machine, that sounds fine. What about processes running on different nodes, with different main memories?
Pardon me if I misunderstand, but isn't that what np.load does already, with or without my modifications?
With your modifications, the user does not get to see the header information before they pick the offset and shape. I contend that the user ought to read the shape information before deciding the shape to use.
Actually, that is what I've done for my own use (trivial parallelism, where I know that axis 0 is "long" and suitable for dividing the workload): Read the shape first, divide its first dimension into chunks with np.array_split(), then memmap the portion I need. I didn't submit that function for inclusion because it is rather specific to my own work. For process "ID" out of "NID", the code is roughly as follows: def memmap_chunk(filename, ID, NID, mode="r"): r = open_memmap(filename, "r") n = r.shape[0] i = np.array_split(range(n), NID)[ID] offset = i[0] shape = 1 + i[-1] - i[0] if len(i) > 0: return open_memmap(filename, mode=mode, offset=offset, shape=shape) else: return np.empty(0, r.dtype)
I don't think that changing the no.load() API is the best way to solve this problem.
I can agree with that. What I actually use is open_memmap() as shown above, but couldn't have done it without offset and shape arguments. In retrospect, changing np.load() was maybe a misstep in trying to generalize from my own hacks to something that might be useful to others. I kind of added offset and shape to np.load "for completeness", as it offers a mmap_mode argument but no way to memory-map just a portion of a file. So to attempt a summary: memory-mapping with np.load may be useful to conserve memory in a single process (with no need for offset and shape arguments), but splitting workload across multiple processes is best done with open_memmap. Then I humbly suggest that having offset and shape arguments to open_memmap is useful.
![](https://secure.gravatar.com/avatar/764323a14e554c97ab74177e0bce51d4.jpg?s=120&d=mm&r=g)
On Tue, Mar 1, 2011 at 17:06, Jon Olav Vik <jonovik@gmail.com> wrote:
Robert Kern <robert.kern <at> gmail.com> writes:
It's up to the virtual memory manager, but usually, it will just load those pages (chunks the size of mmap.PAGESIZE) that are touched by your request and write them back.
What if two processes touch adjacent chunks that are smaller than a page? Is there a risk that writing back an entire page will overwrite the efforts of another process?
I believe that there is only one page in main memory. Each process is simply pointed to the same page. As long as you don't write to the same specific byte, you'll be fine.
Within a single machine, that sounds fine. What about processes running on different nodes, with different main memories?
You mean mmaping a file on a shared file system? Then it's up the file system. I'm honestly not sure what would happen for your particular file system. Try it and report back. In any case, using the offset won't help. The virtual memory manager always deals with whole pages of size mmap.ALLOCATIONBOUNDARY aligned with the start of the file. Under the covers, np.memmap() rounds the offset down to the nearest page boundary and then readjusts the pointer. For performance reasons, I don't recommend doing it anyway. The networked file system becomes the bottleneck, in my experience.
Pardon me if I misunderstand, but isn't that what np.load does already, with or without my modifications?
With your modifications, the user does not get to see the header information before they pick the offset and shape. I contend that the user ought to read the shape information before deciding the shape to use.
Actually, that is what I've done for my own use (trivial parallelism, where I know that axis 0 is "long" and suitable for dividing the workload): Read the shape first, divide its first dimension into chunks with np.array_split(), then memmap the portion I need. I didn't submit that function for inclusion because it is rather specific to my own work. For process "ID" out of "NID", the code is roughly as follows:
def memmap_chunk(filename, ID, NID, mode="r"): r = open_memmap(filename, "r") n = r.shape[0] i = np.array_split(range(n), NID)[ID] offset = i[0] shape = 1 + i[-1] - i[0] if len(i) > 0: return open_memmap(filename, mode=mode, offset=offset, shape=shape) else: return np.empty(0, r.dtype)
I don't think that changing the no.load() API is the best way to solve this problem.
I can agree with that. What I actually use is open_memmap() as shown above, but couldn't have done it without offset and shape arguments.
In retrospect, changing np.load() was maybe a misstep in trying to generalize from my own hacks to something that might be useful to others. I kind of added offset and shape to np.load "for completeness", as it offers a mmap_mode argument but no way to memory-map just a portion of a file.
So to attempt a summary: memory-mapping with np.load may be useful to conserve memory in a single process (with no need for offset and shape arguments), but splitting workload across multiple processes is best done with open_memmap. Then I humbly suggest that having offset and shape arguments to open_memmap is useful.
I disagree. The important bit is to get the header information and the data offset out of the file without loading any data. Once you have that, np.memmap() suffices. You don't need to alter np.open_memmap() at all. In fact, if you do use np.open_memmap() to read the information, then you can't implement your "64-bit-large file on a 32-bit machine" use case. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
![](https://secure.gravatar.com/avatar/5d370232b4ed32caac8bba5672893bfd.jpg?s=120&d=mm&r=g)
Robert Kern <robert.kern <at> gmail.com> writes:
Within a single machine, that sounds fine. What about processes running on different nodes, with different main memories?
You mean mmaping a file on a shared file system?
Yes. GPFS, I believe, presumably this: http://en.wikipedia.org/wiki/GPFS Horrible latency on first access, but otherwise fast enough for my uses. I could have worked on local disk, copied them to my home directory, then consolidated the results, but the convenience of a single file appeals to my one-screenful attention span.
Then it's up the file system. I'm honestly not sure what would happen for your particular file system. Try it and report back.
In any case, using the offset won't help. The virtual memory manager always deals with whole pages of size mmap.ALLOCATIONBOUNDARY aligned with the start of the file. Under the covers, np.memmap() rounds the offset down to the nearest page boundary and then readjusts the pointer.
I have had 1440 processes writing timing information to a numpy file with about 60000 records of (starttime, finishtime) without problem. Likewise, I've written large amounts of output, which was sanity-checked during analysis. I ought to have noticed any errors.
For performance reasons, I don't recommend doing it anyway. The networked file system becomes the bottleneck, in my experience.
What would you suggest instead? Using separate files is an option, but requires a final pass to collect data, or lots of code to navigate the results. Having a master node collect data and write them to file is cumbersome on our queueing system (it's easier to schedule many small jobs that can run whenever there is free time, than require a master and workers to run at the same time). I don't recall the exact numbers, but I have had several hundred processors running simultaneously, writing to the same numpy file on disk. It has been my impression that this is much faster than doing it through a single process. I was hoping to get the speed of writing separate files with the self- documentation of a single structured np.array on disk (open_memmap also saves me a few lines of code in writing output back to disk). That was before I learned about how the virtual memory manager enters into memory-mapping, though -- maybe I was just imagining things 8-/
Then I humbly suggest that having offset and shape arguments to open_memmap is useful.
I disagree. The important bit is to get the header information and the data offset out of the file without loading any data. Once you have that, np.memmap() suffices. You don't need to alter np.open_memmap() at all.
But if you're suggesting that the end user 1) use read_magic to check the version, 2) use read_array_header_1_0 to get shape, fortran_order, dtype, 3) call np.memmap with a suitable offset -- isn't that pretty much tantamount to duplicating np.lib.format.open_memmap? Actually, the function names read_array_header_1_0 and read_magic sound rather internal, not like something intended for an end-user. read_array_header_1_0 seems to be used only by open_memmap and read_array. Given the somewhat confusing array (pardon the pun) of ways to load files in Numpy, np.load() might in fact be a reasonable place to centralize all the options...
In fact, if you do use np.open_memmap() to read the information, then you can't implement your "64-bit-large file on a 32-bit machine" use case.
Do you mean that it could be done with np.memmap? Pardon me for being slow, but what is the crucial difference between np.memmap and open_memmap in this respect?
![](https://secure.gravatar.com/avatar/764323a14e554c97ab74177e0bce51d4.jpg?s=120&d=mm&r=g)
On Tue, Mar 1, 2011 at 18:40, Jon Olav Vik <jonovik@gmail.com> wrote:
Robert Kern <robert.kern <at> gmail.com> writes:
Within a single machine, that sounds fine. What about processes running on different nodes, with different main memories?
You mean mmaping a file on a shared file system?
Yes. GPFS, I believe, presumably this: http://en.wikipedia.org/wiki/GPFS Horrible latency on first access, but otherwise fast enough for my uses. I could have worked on local disk, copied them to my home directory, then consolidated the results, but the convenience of a single file appeals to my one-screenful attention span.
Then it's up the file system. I'm honestly not sure what would happen for your particular file system. Try it and report back.
In any case, using the offset won't help. The virtual memory manager always deals with whole pages of size mmap.ALLOCATIONBOUNDARY aligned with the start of the file. Under the covers, np.memmap() rounds the offset down to the nearest page boundary and then readjusts the pointer.
I have had 1440 processes writing timing information to a numpy file with about 60000 records of (starttime, finishtime) without problem. Likewise, I've written large amounts of output, which was sanity-checked during analysis. I ought to have noticed any errors.
For performance reasons, I don't recommend doing it anyway. The networked file system becomes the bottleneck, in my experience.
What would you suggest instead? Using separate files is an option, but requires a final pass to collect data, or lots of code to navigate the results. Having a master node collect data and write them to file is cumbersome on our queueing system (it's easier to schedule many small jobs that can run whenever there is free time, than require a master and workers to run at the same time).
I don't recall the exact numbers, but I have had several hundred processors running simultaneously, writing to the same numpy file on disk. It has been my impression that this is much faster than doing it through a single process. I was hoping to get the speed of writing separate files with the self- documentation of a single structured np.array on disk (open_memmap also saves me a few lines of code in writing output back to disk).
That was before I learned about how the virtual memory manager enters into memory-mapping, though -- maybe I was just imagining things 8-/
Well, if it's working for you, that's great!
Then I humbly suggest that having offset and shape arguments to open_memmap is useful.
I disagree. The important bit is to get the header information and the data offset out of the file without loading any data. Once you have that, np.memmap() suffices. You don't need to alter np.open_memmap() at all.
But if you're suggesting that the end user 1) use read_magic to check the version, 2) use read_array_header_1_0 to get shape, fortran_order, dtype, 3) call np.memmap with a suitable offset -- isn't that pretty much tantamount to duplicating np.lib.format.open_memmap?
Actually, the function names read_array_header_1_0 and read_magic sound rather internal, not like something intended for an end-user. read_array_header_1_0 seems to be used only by open_memmap and read_array.
That's not what I suggested. I suggested that 1+2 could be wrapped in a single function read_header() that provides the header information and the offset to the data.
Given the somewhat confusing array (pardon the pun) of ways to load files in Numpy, np.load() might in fact be a reasonable place to centralize all the options...
np.load()'s primary purpose is to be the one *simple* way to access NPY files, not the one place to expose every way to access NPY files.
In fact, if you do use np.open_memmap() to read the information, then you can't implement your "64-bit-large file on a 32-bit machine" use case.
Do you mean that it could be done with np.memmap? Pardon me for being slow, but what is the crucial difference between np.memmap and open_memmap in this respect?
You cannot use open_memmap() to read the header information on the 32-bit system since it will also try to map the data portion of the too-large file. If you can read the header information with something that does not try to read the data, then you can select a smaller shape that does fit into your 32-bit address space. It's not that np.memmap() works where open_memmap() doesn't; it's that my putative read_header() function would work where open_memmap() doesn't. My point about np.memmap() is that once you have the header information loaded, you don't need to use open_memmap() any more. np.memmap() does everything you need from that point on. There is no point in making open_memmap() or np.load() more flexible to support this use case. You just need the read_header() function. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
![](https://secure.gravatar.com/avatar/86ea939a72cee216b3c076b52f48f338.jpg?s=120&d=mm&r=g)
Den 01.03.2011 14:20, skrev Jon Olav Vik:
Use case: Generate "large" output for "many" parameter scenarios. 1. Preallocate "enormous" output file on disk.
That's not a usecase, because this will in general require 64-bit, for which the offset parameter does not matter.
Maybe that is impossible with 32-bit Python: at least I cannot allocate that big a file on my laptop.
32-bit Windows will give you 2 GB virtual memory available in user space. The reminding 2 GB is reserved for device drivers etc. I don't know about Linux, but it is approximately the same. Note that I am not talking about physical memory but virtual address space. I am not talking about RAM. When you memory map a file, you use up some of this virtual address space. That is the key. Because the 32-bit address space is so small by today's standard, we often cannot afford to memory map large portions of it. That is where the "offset" helps. Instead of memory mapping the whole file, we just work with a small window of it. But unlike a NumPy subarray view, this slice is in the kernel of the operating system. On 64-bit we have so much virtual memory that it does not matter. How much is system dependent. On recent AMD64 processors it is 256 TB, but I think Windows 64 "only" gives us 16 of those. Even so, this is still approximately 25 times the size of the hard disk on my computer. That is, with 64-bit Python I can memory map everything on my computer, and it would hardly be noticed in the virtual address space. That is why an offset is not needed. A typical usecase for "offset" is a 32-bit database server memory mapping a small vindow of a huge database. On 64-bit the offset could be ignored, and the whole database mapped to memory -- one of the reasons 64-bit database servers perform better. Sturla
participants (4)
-
Jon Olav Vik
-
Ralf Gommers
-
Robert Kern
-
Sturla Molden