Hello Do you think it is likely that the memap capabilities of numpy will find their way in to numpypy any time soon? It seems to me that some people think memap is a relatively unimportant aspect of numpy. But I do not think so. Because of the way the linux IO subsystem and virtual memory systems interact, memap enables numpy to have high performance access to very large data sets -- it helps make numpy relevant to "Big Data". The code to allow numpy to support memap doesn't seem very large. But, while I have tried reading through the code, I really can not tell whether the same is true for numpypy, or whether it is a large endeavor (for example, due to some kind of pypy memory management architectural issue). I'm interested in any input on this. Mike Beller
Hi, this is definitely doable, but needs some work. Here are the tasks I identified: - copy "memmap.py" from the official numpy, and write a unit test (there is a nice docstring in the previous file) - add support in numpy for buffers with a fixed address (in interp-level terms: a RWBuffer with a get_raw_address() method) - have buffer(mmap) return such a buffer, very very similar to array.ArrayType. The three tasks are quite independent, not too difficult, and could be a nice start for newcomers... I'll be happy to help. 2013/6/14 Mike Beller <mike@tradeworx.com>
Hello
Do you think it is likely that the memap capabilities of numpy will find their way in to numpypy any time soon?
It seems to me that some people think memap is a relatively unimportant aspect of numpy. But I do not think so. Because of the way the linux IO subsystem and virtual memory systems interact, memap enables numpy to have high performance access to very large data sets -- it helps make numpy relevant to "Big Data".
The code to allow numpy to support memap doesn't seem very large. But, while I have tried reading through the code, I really can not tell whether the same is true for numpypy, or whether it is a large endeavor (for example, due to some kind of pypy memory management architectural issue).
I'm interested in any input on this.
Mike Beller
_______________________________________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
-- Amaury Forgeot d'Arc
Thank you for responding Amaury. I looked at the code for memmap.py. The core of it comes down to these two lines of code: mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start) self = ndarray.__new__(subtype, shape, dtype=descr, buffer=mm, offset=offset, order=order) It seems to me one needs to modify ndarray.__new__ so it can take a named buffer= argument, where that argument can be an mmap object. Currently the python ndarray.__new__ can accept such an argument, but the numpypy one does not appear to. When I try to figure out the changes to interp_numarray.py to make it work, it goes way over my head. Thoughts? Mike On Fri, Jun 14, 2013 at 11:08 AM, Amaury Forgeot d'Arc <amauryfa@gmail.com>wrote:
Hi, this is definitely doable, but needs some work. Here are the tasks I identified:
- copy "memmap.py" from the official numpy, and write a unit test (there is a nice docstring in the previous file)
- add support in numpy for buffers with a fixed address (in interp-level terms: a RWBuffer with a get_raw_address() method)
- have buffer(mmap) return such a buffer, very very similar to array.ArrayType.
The three tasks are quite independent, not too difficult, and could be a nice start for newcomers... I'll be happy to help.
2013/6/14 Mike Beller <mike@tradeworx.com>
Hello
Do you think it is likely that the memap capabilities of numpy will find their way in to numpypy any time soon?
It seems to me that some people think memap is a relatively unimportant aspect of numpy. But I do not think so. Because of the way the linux IO subsystem and virtual memory systems interact, memap enables numpy to have high performance access to very large data sets -- it helps make numpy relevant to "Big Data".
The code to allow numpy to support memap doesn't seem very large. But, while I have tried reading through the code, I really can not tell whether the same is true for numpypy, or whether it is a large endeavor (for example, due to some kind of pypy memory management architectural issue).
I'm interested in any input on this.
Mike Beller
_______________________________________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
-- Amaury Forgeot d'Arc
2013/6/14 Mike Beller <mike@tradeworx.com>
It seems to me one needs to modify ndarray.__new__ so it can take a named buffer= argument, where that argument can be an mmap object. Currently the python ndarray.__new__ can accept such an argument, but the numpypy one does not appear to.
Correct.
When I try to figure out the changes to interp_numarray.py to make it work, it goes way over my head.
Hey, you need to get used to our RPython language. At least you opened the correct file, that's a great first step :-). In interp_numarray.py: - the block starting with "W_NDimArray.typedef" describe the type and the methods. There is a "__new__" entry, which describes the constructor. The implementation is descr_new_array(). - in descr_new_array(), the variables that start with "w_" are Wrapped objects, i.e. visible to Python (similar to PyObject* in the CPython implementation). At the moment there is a check for "space.is_none(w_buffer)", we'll have to remove it of course. - The goal is to use the memory allocated for w_buffer memory instead of allocating a new one. So don't call W_NDimArray.from_shape(), call W_NDimArray.from_shape_and_storage() instead, it takes the raw address of the buffer. That's it! Of course I skipped over all the details, so the first thing to do is to write a unit test to be sure that everything works correctly before building a new PyPy. -- Amaury Forgeot d'Arc
Ok -- made some progress. Built a unit test -- it's similar to the doctest from memmap.py, but rewritten in py.test format (uses BaseNumpyAppTest from pypy.module.micronumpy.test.test_base, just like test_numeric.py does). Here is the next problem I face: To run my unit test, I added in an import of my new memmap.py, which I have placed in lib_pypy/numpypy/core/memmap.py, and linked into core/__init__.py. This module imports mmap.py for obvious reasons (it wants to use mmap.py to create the mmap object). That import fails when I run the test: python2.7 pytest.py pypy/module/test_lib_pypy/numpypy/core/test_memmap.py ("Import error: no module named mmap") It fails because it can not import the module mmap. Whereas if I just fire up a normal interpreter-level pypy I have no problem importing mmap. Clearly it does exist in pypy. I don't understand which modules will be loadable and which modules will not be loadable when I run in pytest.py. Somehow, clearly, the BaseNumpyAppTest inheritance mechanism has allowed all the numpypy stuff to show up (but only inside the class definition, not at module scope), but other pypy application-level modules are not available. I need to somehow invoke the same magic for mmap if I want to be able to use the mmap module to implement the numpypy memmap functionality. I read coding-guide.rst but it's still not obvious to me. (Alternatively I suppose the answer could be full translation, but testing this way would take a full 90-minute translation every time I wanted to change a line of code.) If you feel this detailed emailing is wasting your time, just let me know and I can drop it. (Or if you want to take the conversation off pypy-dev that's ok too.) Regards Mike On Fri, Jun 14, 2013 at 5:43 PM, Amaury Forgeot d'Arc <amauryfa@gmail.com>wrote:
2013/6/14 Mike Beller <mike@tradeworx.com>
It seems to me one needs to modify ndarray.__new__ so it can take a named buffer= argument, where that argument can be an mmap object. Currently the python ndarray.__new__ can accept such an argument, but the numpypy one does not appear to.
Correct.
When I try to figure out the changes to interp_numarray.py to make it work, it goes way over my head.
Hey, you need to get used to our RPython language. At least you opened the correct file, that's a great first step :-). In interp_numarray.py:
- the block starting with "W_NDimArray.typedef" describe the type and the methods. There is a "__new__" entry, which describes the constructor. The implementation is descr_new_array().
- in descr_new_array(), the variables that start with "w_" are Wrapped objects, i.e. visible to Python (similar to PyObject* in the CPython implementation). At the moment there is a check for "space.is_none(w_buffer)", we'll have to remove it of course.
- The goal is to use the memory allocated for w_buffer memory instead of allocating a new one. So don't call W_NDimArray.from_shape(), call W_NDimArray.from_shape_and_storage() instead, it takes the raw address of the buffer. That's it!
Of course I skipped over all the details, so the first thing to do is to write a unit test to be sure that everything works correctly before building a new PyPy.
-- Amaury Forgeot d'Arc
2013/6/16 Mike Beller <mike@tradeworx.com>
Ok -- made some progress. Built a unit test -- it's similar to the doctest from memmap.py, but rewritten in py.test format (uses BaseNumpyAppTest from pypy.module.micronumpy.test.test_base, just like test_numeric.py does).
Here is the next problem I face: To run my unit test, I added in an import of my new memmap.py, which I have placed in lib_pypy/numpypy/core/memmap.py, and linked into core/__init__.py. This module imports mmap.py for obvious reasons (it wants to use mmap.py to create the mmap object). That import fails when I run the test: python2.7 pytest.py pypy/module/test_lib_pypy/numpypy/core/test_memmap.py ("Import error: no module named mmap")
great progress anyway!
It fails because it can not import the module mmap. Whereas if I just fire up a normal interpreter-level pypy I have no problem importing mmap. Clearly it does exist in pypy.
I don't understand which modules will be loadable and which modules will not be loadable when I run in pytest.py. Somehow, clearly, the BaseNumpyAppTest inheritance mechanism has allowed all the numpypy stuff to show up (but only inside the class definition, not at module scope), but other pypy application-level modules are not available. I need to somehow invoke the same magic for mmap if I want to be able to use the mmap module to implement the numpypy memmap functionality. I read coding-guide.rst but it's still not obvious to me. (Alternatively I suppose the answer could be full translation, but testing this way would take a full 90-minute translation every time I wanted to change a line of code.)
That's because most built-in modules are not made available by default. You need to pass the equivalent of "--withmod-mmap" to the interpreter. In a test, this is done in a statement like: spaceconfig = dict(usemodules=["mmap"]) at the top of the test class. There certainly is already one (in order to import numpy), you could add "mmap" to the existing list.
If you feel this detailed emailing is wasting your time, just let me know and I can drop it. (Or if you want to take the conversation off pypy-dev that's ok too.)
You are welcome. PyPy is a wonderful platform with powerful tools, but they are often not documented enough. Pypy-dev traffic is not that high these days; to get immediate answers there many of us are hanging on IRC: #pypy at freenode.net. -- Amaury Forgeot d'Arc
On Sun, Jun 16, 2013 at 9:29 PM, Amaury Forgeot d'Arc <amauryfa@gmail.com>wrote:
If you feel this detailed emailing is wasting your time, just let me know
and I can drop it. (Or if you want to take the conversation off pypy-dev that's ok too.)
You are welcome. PyPy is a wonderful platform with powerful tools, but they are often not documented enough. Pypy-dev traffic is not that high these days; to get immediate answers there many of us are hanging on IRC: #pypy at freenode.net.
Exactly, and you're not alone following the discussion. I personally welcome all those detailed explanations about how to do something inside pypy, because I've tried myself and it was hard to understand the big picture and also hard to get all the little details that escaped me at that time. So please continue this on-list, if deemed appropriate. There are people learning through it. -- Vincent Legoll
Thank you both, for your encouragement. So I have made progress. I now have some unit tests, and they can compile and run. I have imported the memmap.py from numpy, and modified it so it gets the needed items from micronumpy. I can run the unit tests and they fail for the correct reason -- that reason being the buffer attribute is not supported in interp_numarray.descr_new_array() . So that's great. Here are my next questions: 1) The way the numpy mmap module works, it calls ndarray.__new__(), and monkey-patches the return value to add _mmap, offset, and mode attributes. This, for example, ensures the mmap object is kept around until the array is deleted. However, I can't monkey-patch a numpy ndarray object. I presume this is because it is an interpreter level object rather than an app level one? Anyway -- not sure how to deal with this situation. 2) Secondly, the mmap object itself doesn't really provide a usable buffer implementation. The implementation of buffer(mmap) is currently W_MMap.descr_buffer(), (found in interp_mmap.py), which returns a StringLikeBuffer object. This object (implemented in pypy/interpreter/buffer.py) is a subclass of Buffer, which does not implement get_raw_address(). Our current plan clearly requires the buffer object to implement get_raw_address so it can be used by ndarray.from_shape_and_storage(). Interestingly, it seems as if the interp_mmap author anticipated this shortcoming -- there is a comment: "improve to work directly on low-level address" right in the descr_buffer method. So -- am I on the wrong path? Should I not even bother trying to use the mmap? (since I can't monkey patch it and it doesn't do what I want?) This would mean perhaps using the underlying rffi mmap to build my own memmap module. Alternatively, can I fix the monkey-patching problem some other way, and then take the advice of interp_mmap's author to "improve to work directly on low-level address" by returning something better than a StringLikeBuffer object. Thoughts? Mike On Sun, Jun 16, 2013 at 3:29 PM, Amaury Forgeot d'Arc <amauryfa@gmail.com>wrote:
2013/6/16 Mike Beller <mike@tradeworx.com>
Ok -- made some progress. Built a unit test -- it's similar to the doctest from memmap.py, but rewritten in py.test format (uses BaseNumpyAppTest from pypy.module.micronumpy.test.test_base, just like test_numeric.py does).
Here is the next problem I face: To run my unit test, I added in an import of my new memmap.py, which I have placed in lib_pypy/numpypy/core/memmap.py, and linked into core/__init__.py. This module imports mmap.py for obvious reasons (it wants to use mmap.py to create the mmap object). That import fails when I run the test: python2.7 pytest.py pypy/module/test_lib_pypy/numpypy/core/test_memmap.py ("Import error: no module named mmap")
great progress anyway!
It fails because it can not import the module mmap. Whereas if I just fire up a normal interpreter-level pypy I have no problem importing mmap. Clearly it does exist in pypy.
I don't understand which modules will be loadable and which modules will not be loadable when I run in pytest.py. Somehow, clearly, the BaseNumpyAppTest inheritance mechanism has allowed all the numpypy stuff to show up (but only inside the class definition, not at module scope), but other pypy application-level modules are not available. I need to somehow invoke the same magic for mmap if I want to be able to use the mmap module to implement the numpypy memmap functionality. I read coding-guide.rst but it's still not obvious to me. (Alternatively I suppose the answer could be full translation, but testing this way would take a full 90-minute translation every time I wanted to change a line of code.)
That's because most built-in modules are not made available by default. You need to pass the equivalent of "--withmod-mmap" to the interpreter. In a test, this is done in a statement like: spaceconfig = dict(usemodules=["mmap"]) at the top of the test class. There certainly is already one (in order to import numpy), you could add "mmap" to the existing list.
If you feel this detailed emailing is wasting your time, just let me know and I can drop it. (Or if you want to take the conversation off pypy-dev that's ok too.)
You are welcome. PyPy is a wonderful platform with powerful tools, but they are often not documented enough. Pypy-dev traffic is not that high these days; to get immediate answers there many of us are hanging on IRC: #pypy at freenode.net.
-- Amaury Forgeot d'Arc
2013/6/20 Mike Beller <mike@tradeworx.com>
Thank you both, for your encouragement.
So I have made progress. I now have some unit tests, and they can compile and run. I have imported the memmap.py from numpy, and modified it so it gets the needed items from micronumpy. I can run the unit tests and they fail for the correct reason -- that reason being the buffer attribute is not supported in interp_numarray.descr_new_array() . So that's great.
Great indeed!
Here are my next questions:
1) The way the numpy mmap module works, it calls ndarray.__new__(), and monkey-patches the return value to add _mmap, offset, and mode attributes. This, for example, ensures the mmap object is kept around until the array is deleted. However, I can't monkey-patch a numpy ndarray object. I presume this is because it is an interpreter level object rather than an app level one? Anyway -- not sure how to deal with this situation.
No, that's because ndarray.__new__(subtype, ...) returns a ndarray. This is wrong, it should return a instance of the subtype (the memmap.memmap class in this case). ndarray has no __dict__, OTOH memmap is a Python-defined class and has a __dict__ where you can store attributes. (It's also possible to have a __dict__ on ndarray, but it's not necessary here) In interp_numarray.py, descr_new_array() does not use w_subtype at all! This means that ndarray cannot be subclassed in Python... To make the necessary changes, you can pick one module and see how it's done there. For example, in the bz2 module, __new__ could simply have written "return W_BZ2File(space)", but instead it handles subclasses correctly: def descr_bz2file__new__(space, w_subtype, __args__): bz2file = space.allocate_instance(W_BZ2File, w_subtype) W_BZ2File.__init__(bz2file, space) return space.wrap(bz2file) ndarray should do the same, maybe by changing W_NDimArray.from_shape() (and friends) this way: @classmethod def from_shape(w_subtype, shape, dtype, order='C'): ... if w_subtype is None: return W_NDimArray(impl) else: ...allocate_instance, __init__ and wrap... This subclassing feature should have its own unit tests, btw.
2) Secondly, the mmap object itself doesn't really provide a usable buffer implementation. The implementation of buffer(mmap) is currently W_MMap.descr_buffer(), (found in interp_mmap.py), which returns a StringLikeBuffer object. This object (implemented in pypy/interpreter/buffer.py) is a subclass of Buffer, which does not implement get_raw_address(). Our current plan clearly requires the buffer object to implement get_raw_address so it can be used by ndarray.from_shape_and_storage(). Interestingly, it seems as if the interp_mmap author anticipated this shortcoming -- there is a comment: "improve to work directly on low-level address" right in the descr_buffer method.
So -- am I on the wrong path? Should I not even bother trying to use the mmap? (since I can't monkey patch it and it doesn't do what I want?) This would mean perhaps using the underlying rffi mmap to build my own memmap module. Alternatively, can I fix the monkey-patching problem some other way, and then take the advice of interp_mmap's author to "improve to work directly on low-level address" by returning something better than a StringLikeBuffer object.
This was the third task I mentioned earlier. It turns out that Armin implemented it just this morning, thanks! :-) Mike, you are doing well. Please keep going. -- Amaury Forgeot d'Arc
Amaury Sorry for the high latency. Day job intervenes. (This may take some time :-) ) I have reviewed your advice (below) and I can certainly modify class W_NDimArray as proposed. Here is my question. It seems from looking at other similar modules, the convention in pypy/module is to have w_subtype as the second positional parameter to the factory method. Changing the 4 factory methods of W_NDimArray to follow this convention would involve changing the following references (found using egrep/awk/uniq so I may not be exactly right but you get the right order of magnitude): 2 arrayimpl/concrete.py: 2 arrayimpl/scalar.py: 1 arrayimpl/sort.py: 1 base.py: 3 compile.py: 5 interp_arrayops.py: 1 interp_dtype.py: 1 interp_flatiter.py: 19 interp_numarray.py: 2 interp_support.py: 3 interp_ufuncs.py: 2 iter.py: 3 loop.py: 2 test/test_numarray.py: 47 references in 14 files. I'm a bit hesitant to change code spewed all over micronumpy unless someone with more ownership over the code tells me for sure that's the right thing to do. So I can also propose 2 alternatives: 1) Create a parallel set of methods: from_shape_and_subtype, from_shape_and_storage_and_subtype, for example, which take w_subtype, but this would cause method proliferation. 2) Add a named parameter to the existing factory methods, so they are like from_shape(shape, subtype=None....), but this would be kind of nonstandard. On a related matter, I found other modules in the code base where w_subtype is ignored (e.g. select/interp_epoll/W_EPoll, and others) -- seems a common problem -- so maybe "kind of nonstandard" is not that important -- or perhaps this is an area which was simply left to be cleaned up later. Thoughts about best approach? Mike On Thu, Jun 20, 2013 at 5:07 AM, Amaury Forgeot d'Arc <amauryfa@gmail.com>wrote:
In interp_numarray.py, descr_new_array() does not use w_subtype at all! This means that ndarray cannot be subclassed in Python... To make the necessary changes, you can pick one module and see how it's done there. For example, in the bz2 module, __new__ could simply have written "return W_BZ2File(space)", but instead it handles subclasses correctly:
def descr_bz2file__new__(space, w_subtype, __args__): bz2file = space.allocate_instance(W_BZ2File, w_subtype) W_BZ2File.__init__(bz2file, space) return space.wrap(bz2file)
ndarray should do the same, maybe by changing W_NDimArray.from_shape() (and friends) this way:
@classmethod def from_shape(w_subtype, shape, dtype, order='C'): ... if w_subtype is None: return W_NDimArray(impl) else: ...allocate_instance, __init__ and wrap...
This subclassing feature should have its own unit tests, btw.
Hi Mike, On Sun, Jun 30, 2013 at 7:15 PM, Mike Beller <mike@tradeworx.com> wrote:
is ignored (e.g. select/interp_epoll/W_EPoll, and others) -- seems a common problem
At the end of that file we have "W_Epoll.typedef.acceptable_as_base_class = False", meaning that it cannot be subclassed as app-level; so ignoring w_subtype is correct. Are there other examples which don't have "acceptable_as_base_class = False" ? A bientôt, Armin.
2013/6/30 Armin Rigo <arigo@tunes.org>
Hi Mike,
On Sun, Jun 30, 2013 at 7:15 PM, Mike Beller <mike@tradeworx.com> wrote:
is ignored (e.g. select/interp_epoll/W_EPoll, and others) -- seems a common problem
At the end of that file we have "W_Epoll.typedef.acceptable_as_base_class = False", meaning that it cannot be subclassed as app-level; so ignoring w_subtype is correct. Are there other examples which don't have "acceptable_as_base_class = False" ?
I tried to make a list. In the "module/" subdirectory, here are the __new__ methods that don't use "allocate_instance" and don't define "acceptable_as_base_class": clr._CliObject_internal _ffi.CDLL _ffi.WinDLL _ffi.Field _ffi._StructDescr numpypy.string_ numpypy.unicode_ numpypy.dtype numpypy.ndarray pypyjit.Box _rawffi.CallbackPtr _rawffi.CDLL _winreg.HKEYType zipimport.zipimporter -- Amaury Forgeot d'Arc
Hello, 2013/6/30 Mike Beller <mike@tradeworx.com>
Amaury
Sorry for the high latency. Day job intervenes. (This may take some time :-) )
No problem, we are all volunteers here. There is no pressure. The worst that can happen is someone may need the feature and implement it before you :-)
I have reviewed your advice (below) and I can certainly modify class W_NDimArray as proposed. Here is my question. It seems from looking at other similar modules, the convention in pypy/module is to have w_subtype as the second positional parameter to the factory method. Changing the 4 factory methods of W_NDimArray to follow this convention would involve changing the following references (found using egrep/awk/uniq so I may not be exactly right but you get the right order of magnitude):
2 arrayimpl/concrete.py: 2 arrayimpl/scalar.py: 1 arrayimpl/sort.py: 1 base.py: 3 compile.py: 5 interp_arrayops.py: 1 interp_dtype.py: 1 interp_flatiter.py: 19 interp_numarray.py: 2 interp_support.py: 3 interp_ufuncs.py: 2 iter.py: 3 loop.py: 2 test/test_numarray.py:
47 references in 14 files.
Ouch!
I'm a bit hesitant to change code spewed all over micronumpy unless someone with more ownership over the code tells me for sure that's the right thing to do. So I can also propose 2 alternatives:
1) Create a parallel set of methods: from_shape_and_subtype, from_shape_and_storage_and_subtype, for example, which take w_subtype, but this would cause method proliferation.
2) Add a named parameter to the existing factory methods, so they are like from_shape(shape, subtype=None....), but this would be kind of nonstandard. On a related matter, I found other modules in the code base where w_subtype is ignored (e.g. select/interp_epoll/W_EPoll, and others) -- seems a common problem -- so maybe "kind of nonstandard" is not that important -- or perhaps this is an area which was simply left to be cleaned up later.
Both approach can work IMO: for 1), these are only two new functions, and I'm sure they can share most of the code with existing functions. method 2) would work as well, and is proably even easier to implement. And there is no standard that forbids named parameter! Some calls to from_shape() already use them... Do as you prefer, and what makes the code more natural to you. Cheers, -- Amaury Forgeot d'Arc
participants (4)
-
Amaury Forgeot d'Arc
-
Armin Rigo
-
Mike Beller
-
Vincent Legoll