A numpy accumulator...
OK -- this one I'm intending to send! Hi all, This idea was inspired by a discussion at the SciPy conference, in which we spent a LOT of time during the numpy tutorial talking about how to accumulate values in an array when you don't know how big the array needs to be when you start. The "standard practice" is to accumulate in a python list, then convert the final result into an array. This is a good idea because Python lists are standard, well tested, efficient, etc. However, as was pointed out in that lengthy discussion, if what you are doing is accumulating is a whole bunch of numbers (ints, floats, whatever), or particularly if you need to accumulate a data type that plain python doesn't support, there is a lot of overhead involved: a python float type is pretty heavyweight. If performance or memory use is important, it might create issues. You can use and array.array, but it doesn't support all numpy types, particularly custom dtypes. I talked about this on the cython list (as someone asked how to do accumulate in cython), and a few folks thought it would be useful, so I put together a prototype. What I have in mind is very simple. It would be: - Only 1-d - Support append() and extend() methods - support indexing and slicing - Support any valid numpy dtype - which could even get you pseudo n-d arrays... - maybe it would act like an array in other ways, I'm not so sure. - ufuncs, etc. It could take the place of using python lists/arrays when you really want a numpy array, but don't know how big it will be until you've filled it. The implementation I have now uses a regular numpy array as the "buffer". The buffer is re-sized as needed with ndarray.resize(). I've enclosed the class, a bunch of tests (This is the first time I've ever really done test-driven development, though I wouldn't say that this is a complete test suite). A few notes about this implementation: * the name of the class could be better, and so could some of the method names. * on further thought, I think it could handle n-d arrays, as long as you only accumulated along the first index. * It could use a bunch more methods - deleting part of the array - math - probably anything supported by array.array would be good. * Robert pointed me to the array.array implementation to see how it expands the buffer as you append. It did tricks to get it to grow fast when the array is very small, then eventually to add about 1/16 of the used array size to the buffer. I imagine that this would gets used because you were likely to have a big array, so I didn't bother and start with a buffer at 128 elements, then add 1/4 each time you need to expand -- these are both tweakable attributes. * I'm keeping the buffer a hidden variable, and slicing and __array__ return copies - this is so that it won't get multiple references, and then not be expandable. * I did a little simple profiling, and discovered that it's slower than a python list by a factor of more than 2 (for accumulating python ints, anyway). With a bit of experimentation, I think that's because of a couple factors: - an extra function call -- the append() method needs to then do an assignment to the buffer - Object conversion -- python lists store python objects, so the python int can just go right in there. with numpy, it needs to be converted to a C int first -- a bit if extra overhead. Though a straight assignment into a pre-allocated array i faster than a list. I think it's still an improvement for memory use. Maybe it would be worth writing in C or Cython to avoid some of this. In particular, it would be nice if you could use it in Cython, and put C types directly it... * This could be pretty useful for things like genfromtxt. What do folks think? is this useful? What would you change, etc? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov #!/usr/bin/env python """ accumulator class Designed to be used as an expandable umpy array, so accumulate values, rather than a python list. Note that slices return copies, rather than views, unlike regular numpy arrays. This is so that the buffer can be re-allocated without messing up any views. """ import numpy as np class accumulator: #A few parameters DEFAULT_BUFFER_SIZE = 128 BUFFER_EXTEND_SIZE = 1.25 # array.array uses 1+1/16 -- that seem small to me. def __init__(self, object=[], dtype=None, length=None): """ proper docs here """ buffer = np.array(object, dtype=dtype, copy=True) if buffer.ndim > 1: raise ValueError("accumulator only works with 1-d data") self.length = buffer.shape[0] ## dd the padding to the buffer buffer.resize( max(self.DEFAULT_BUFFER_SIZE, buffer.shape[0]*self.BUFFER_EXTEND_SIZE) ) self.__buffer = buffer @property def dtype(self): return self.__buffer.dtype @property def buffersize(self): """ the size of the internal buffer """ return self.__buffer.size def __len__(self): return self.length def __array__(self, dtype=None): """ a.__array__(|dtype) -> copy of array. Always returns a copy array, so that buffer doesn't have any references to it. """ return np.array(self.__buffer[:self.length], dtype=dtype, copy=True) def append(self, item): """ add a new item to the end of the array """ try: self.__buffer[self.length] = item self.length += 1 except IndexError: # the buffer is not big enough self.resize(self.length*self.BUFFER_EXTEND_SIZE) self.append(item) def extend(self, items): """ add a sequence of new items to the end of the array """ print "in extend" print "items:", items print "len (items):", len(items) try: self.__buffer[self.length:self.length+len(items)] = items self.length += len(items) except ValueError: # the buffer is not big enough self.resize((self.length+len(items))*self.BUFFER_EXTEND_SIZE) self.extend(items) def resize(self, newsize): """ resize the internal buffer you might want to do this to speed things up if you know you want it to be a lot bigger eventually """ if newsize < self.length: raise ValueError("accumulator buffer cannot be made smaller that the length of the data") self.__buffer.resize(newsize) def fitbuffer(self): """ re-sizes the buffer so that it fits the data, rather than having extra space """ self.__buffer.resize(self.length) def __getitem__(self, index): if index > self.length-1: raise IndexError("index out of bounds") elif index < 0: index = self.length-1 return self.__buffer[index] def __getslice__(self, i, j): """ a.__getslice__(i, j) <==> a[i:j] Use of negative indices is not supported. This returns a COPY, not a view, unlike numpy arrays This is required as the data buffer needs to be able to change. """ if j > self.length: j = self.length return self.__buffer[i:j].copy() def __str__(self): return self.__buffer[:self.length].__str__() def __repr__(self): return "accumulator%s"%self.__buffer[:self.length].__repr__()[5:] #!/usr/bin/env python """ tests for the accumulator class designed to be run with nose """ import unittest import numpy as np from numpy.testing import assert_array_equal from accumulator import accumulator class test_init(unittest.TestCase): def test_nd(self): self.assertRaises(ValueError, accumulator, ((1,2),(3,4) ) ) def test_empty(self): a = accumulator() self.assertEqual( len(a), 0 ) def test_simple(self): a = accumulator( (1,2,3) ) self.assertEqual( len(a), 3 ) def test_buffer_init(self): a = accumulator() self.assertEqual( a.buffersize, a.DEFAULT_BUFFER_SIZE ) def test_dtype(self): a = accumulator(dtype=np.uint8) self.assertEqual( a.dtype, np.uint8 ) class test_indexing(unittest.TestCase): def test_simple_index(self): a = accumulator( (1,2,3,4,5) ) self.assertEqual(a[1], 2) def test_neg_index(self): a = accumulator( (1,2,3,4,5) ) self.assertEqual(a[-1], 5) def test_index_too_big(self): a = accumulator( (1,2,3,4,5) ) # I can't figure out how to use asserRaises for a non-callable try: a[6] except IndexError: pass else: raise Exception("This test didn't raise an IndexError") def test_append_then_index(self): a = accumulator( () ) for i in range(20): a.append(i) self.assertEqual(a[15], 15) def test_indexs_then_resize(self): """ this here to see if having a view on teh buffer causes problems """ a = accumulator( (1,2,3,4,5) ) b = a[4] a.resize(1000) class test_slicing(unittest.TestCase): def test_simple_slice(self): a = accumulator( (1,2,3,4,5) ) assert_array_equal(a[1:3], np.array([2, 3])) def test_too_big_slice(self): b = np.array( (1.0, 3, 4, 5, 6) ) a = accumulator( b ) assert_array_equal(a[1:10], b[1:10]) def test_full_slice(self): b = np.array( (1.0, 3, 4, 5, 6) ) a = accumulator( b ) assert_array_equal(a[:], b[:]) def test_slice_then_resize(self): """ this here to see if having a view on teh buffer causes problems """ a = accumulator( (1,2,3,4,5) ) b = a[2:4] a.resize(1000) class test_append(unittest.TestCase): def test_append_length(self): a = accumulator( (1,2,3) ) a.append(4) self.assertEqual(len(a), 4) class test_extend(unittest.TestCase): def test_extend_length(self): a = accumulator( (1,2,3) ) a.extend( (4, 5, 6) ) self.assertEqual(len(a), 6) def test_extend_long(self): a = accumulator( range(100) ) a.extend( range(100) ) self.assertEqual(a.length, 200) class test_resize(unittest.TestCase): def test_resize_longer(self): a = accumulator( (1,2,3) ) a.resize(1000) self.assertEqual(a.buffersize, 1000) def test_resize_too_short(self): a = accumulator( (1,2,3,4,5,6,7,8) ) self.assertRaises(ValueError, a.resize, 5) def test_fitbuffer(self): a = accumulator( (1,2,3) ) a.fitbuffer() self.assertEqual(a.buffersize, 3) class test__array__(unittest.TestCase): def test_asarray(self): a = accumulator( (1,2,3) ) b = np.asarray(a) print b assert_array_equal(a[:], b) def test_asarray_dtype(self): a = accumulator( (1,2,3), dtype=np.uint32 ) b = np.asarray(a, dtype=np.float) self.assertEqual(b.dtype, np.float) class test_strings(unittest.TestCase): def test_str(self): a = accumulator( (1,2,3), dtype=np.float ) self.assertEqual(str(a), '[ 1. 2. 3.]') def test_repr(self): a = accumulator( (1,2,3), dtype=np.float ) self.assertEqual(repr(a), 'accumulator([ 1., 2., 3.])') class test_dtypes(unittest.TestCase): def test_simple_dtype(self): a = accumulator( ('this', 'that', 'the' ,'other'), dtype='S10' ) self.assertEqual(a[3], 'other') def test_custom_dtype(self): dt = np.dtype([('x','f4'), ('y','f4'), ('i', 'i2'), ('name', 'S32')]) a = accumulator( dtype=dt ) a.append( (3.5, 6.125, 4, 'something') ) a.append( (0.5, 7, 5, 'nothing') ) a.append( (7.1, 6.0, 7, 'what?') ) self.assertEqual(tuple(a[1]), (0.5, 7.0, 5, 'nothing')) #!/usr/bin/env python """ some code to help profile the accumulator class designed to be run with ipython "timeit" """ import numpy as np import accumulator reload( accumulator) from accumulator import accumulator import array def list1(): """ using a list to accumulate integers, then make an array out of it. """ l = [] for i in xrange(1000): l.append(i) return np.array(l) def array1(): """ using a list to accumulate integers, then make an array out of it. """ l = array.array('i') for i in xrange(1000): l.append(i) return np.array(l) def accum1(): """ using an accumulator to accumulate integers, then make an array out of it. """ l = accumulator((), dtype=np.int) for i in xrange(1000): l.append(i) return np.array(l) def ndarray(): """ using an array, pre-allocated """ l = np.empty((1000,), dtype=np.int) for i in xrange(1000): l[i] = i return np.array(l)
Christopher Barker wrote:
OK -- this one I'm intending to send!
Hi all,
This idea was inspired by a discussion at the SciPy conference, in which we spent a LOT of time during the numpy tutorial talking about how to accumulate values in an array when you don't know how big the array needs to be when you start.
The "standard practice" is to accumulate in a python list, then convert the final result into an array. This is a good idea because Python lists are standard, well tested, efficient, etc.
However, as was pointed out in that lengthy discussion, if what you are doing is accumulating is a whole bunch of numbers (ints, floats, whatever), or particularly if you need to accumulate a data type that plain python doesn't support, there is a lot of overhead involved: a python float type is pretty heavyweight. If performance or memory use is important, it might create issues. You can use and array.array, but it doesn't support all numpy types, particularly custom dtypes.
I talked about this on the cython list (as someone asked how to do accumulate in cython), and a few folks thought it would be useful, so I put together a prototype.
What I have in mind is very simple. It would be: - Only 1-d - Support append() and extend() methods - support indexing and slicing - Support any valid numpy dtype - which could even get you pseudo n-d arrays... - maybe it would act like an array in other ways, I'm not so sure. - ufuncs, etc.
It could take the place of using python lists/arrays when you really want a numpy array, but don't know how big it will be until you've filled it.
The implementation I have now uses a regular numpy array as the "buffer". The buffer is re-sized as needed with ndarray.resize(). I've enclosed the class, a bunch of tests (This is the first time I've ever really done test-driven development, though I wouldn't say that this is a complete test suite).
A few notes about this implementation:
* the name of the class could be better, and so could some of the method names.
* on further thought, I think it could handle n-d arrays, as long as you only accumulated along the first index.
* It could use a bunch more methods - deleting part of the array - math - probably anything supported by array.array would be good.
* Robert pointed me to the array.array implementation to see how it expands the buffer as you append. It did tricks to get it to grow fast when the array is very small, then eventually to add about 1/16 of the used array size to the buffer. I imagine that this would gets used because you were likely to have a big array, so I didn't bother and start with a buffer at 128 elements, then add 1/4 each time you need to expand -- these are both tweakable attributes.
* I'm keeping the buffer a hidden variable, and slicing and __array__ return copies - this is so that it won't get multiple references, and then not be expandable.
* I did a little simple profiling, and discovered that it's slower than a python list by a factor of more than 2 (for accumulating python ints, anyway). With a bit of experimentation, I think that's because of a couple factors: - an extra function call -- the append() method needs to then do an assignment to the buffer - Object conversion -- python lists store python objects, so the python int can just go right in there. with numpy, it needs to be converted to a C int first -- a bit if extra overhead. Though a straight assignment into a pre-allocated array i faster than a list.
I think it's still an improvement for memory use.
Maybe it would be worth writing in C or Cython to avoid some of this. In particular, it would be nice if you could use it in Cython, and put C types directly it...
* This could be pretty useful for things like genfromtxt.
What do folks think? is this useful? What would you change, etc?
I'd drop the __getslice__ as it is deprecated (in Python 3 it is removed). Slices will be passed as "slice" objects to __getitem__ if you don't provide __getslice__. One could support myaccumulator[[1,2,3]] as well in __getitem__, although I guess it gets a little hairy as you must seek through the array-like object passed and see to it that no values are too large. -- Dag Sverre
A Saturday 03 October 2009 10:06:12 Christopher Barker escrigué:
OK -- this one I'm intending to send!
Hi all,
This idea was inspired by a discussion at the SciPy conference, in which we spent a LOT of time during the numpy tutorial talking about how to accumulate values in an array when you don't know how big the array needs to be when you start.
The "standard practice" is to accumulate in a python list, then convert the final result into an array. This is a good idea because Python lists are standard, well tested, efficient, etc.
However, as was pointed out in that lengthy discussion, if what you are doing is accumulating is a whole bunch of numbers (ints, floats, whatever), or particularly if you need to accumulate a data type that plain python doesn't support, there is a lot of overhead involved: a python float type is pretty heavyweight. If performance or memory use is important, it might create issues. You can use and array.array, but it doesn't support all numpy types, particularly custom dtypes.
I talked about this on the cython list (as someone asked how to do accumulate in cython), and a few folks thought it would be useful, so I put together a prototype.
What I have in mind is very simple. It would be: - Only 1-d - Support append() and extend() methods - support indexing and slicing - Support any valid numpy dtype - which could even get you pseudo n-d arrays... - maybe it would act like an array in other ways, I'm not so sure. - ufuncs, etc.
It could take the place of using python lists/arrays when you really want a numpy array, but don't know how big it will be until you've filled it.
The implementation I have now uses a regular numpy array as the "buffer". The buffer is re-sized as needed with ndarray.resize(). I've enclosed the class, a bunch of tests (This is the first time I've ever really done test-driven development, though I wouldn't say that this is a complete test suite).
A few notes about this implementation:
* the name of the class could be better, and so could some of the method names.
* on further thought, I think it could handle n-d arrays, as long as you only accumulated along the first index.
* It could use a bunch more methods - deleting part of the array - math - probably anything supported by array.array would be good.
* Robert pointed me to the array.array implementation to see how it expands the buffer as you append. It did tricks to get it to grow fast when the array is very small, then eventually to add about 1/16 of the used array size to the buffer. I imagine that this would gets used because you were likely to have a big array, so I didn't bother and start with a buffer at 128 elements, then add 1/4 each time you need to expand -- these are both tweakable attributes.
* I'm keeping the buffer a hidden variable, and slicing and __array__ return copies - this is so that it won't get multiple references, and then not be expandable.
* I did a little simple profiling, and discovered that it's slower than a python list by a factor of more than 2 (for accumulating python ints, anyway). With a bit of experimentation, I think that's because of a couple factors: - an extra function call -- the append() method needs to then do an assignment to the buffer - Object conversion -- python lists store python objects, so the python int can just go right in there. with numpy, it needs to be converted to a C int first -- a bit if extra overhead. Though a straight assignment into a pre-allocated array i faster than a list.
I think it's still an improvement for memory use.
Maybe it would be worth writing in C or Cython to avoid some of this. In particular, it would be nice if you could use it in Cython, and put C types directly it...
* This could be pretty useful for things like genfromtxt.
What do folks think? is this useful? What would you change, etc?
That's interesting. I'd normally use the `resize()` method for what you want, but indeed your approach is way more easy-to-use. If you are looking for performance improvements, I'd have a look at the `PyArray_Resize()` function in 'core/src/multiarray/shape.c' (trunk). It seems to me that the zero-initialization of added memory can be skipped, allowing for more performance for the `resize()` method (most specially for large size increments). A new parameter (say, ``zero_init=True``) could be added to `resize()` to specify that you don't want the memory initialized. -- Francesc Alted
Francesc Alted wrote:
A Saturday 03 October 2009 10:06:12 Christopher Barker escrigué:
This idea was inspired by a discussion at the SciPy conference, in which we spent a LOT of time during the numpy tutorial talking about how to accumulate values in an array when you don't know how big the array needs to be when you start.
What I have in mind is very simple. It would be: - Only 1-d - Support append() and extend() methods - support indexing and slicing - Support any valid numpy dtype - which could even get you pseudo n-d arrays... - maybe it would act like an array in other ways, I'm not so sure. - ufuncs, etc.
That's interesting. I'd normally use the `resize()` method for what you want, but indeed your approach is way more easy-to-use.
Of course, this is using resize() under the hood, but giving it an easier interface, but more importantly, it's adding the pre-allocation for you, and the code to deal with that. I suppose I should benchmark it, but I think calling resize(0 with every append would be a lot slower (though maybe not -- might the compiler/os be pre-allocating some extra memory anyway?) I should profile this -- if you can call resize() with every new item, and it's not too slow, then it may not be worth writing this class at all (or I could make it simpler, maybe even an nd-array subclass instead.
If you are looking for performance improvements, I'd have a look at the `PyArray_Resize()` function in 'core/src/multiarray/shape.c' (trunk). It seems to me that the zero-initialization of added memory can be skipped, allowing for more performance for the `resize()` method (most specially for large size increments).
I suppose so, but I doubt that's causing any of my performance issues. Another thing to profile.
A new parameter (say, ``zero_init=True``) could be added to `resize()` to specify that you don't want the memory initialized.
That does seem like a good idea, but maybe over my head to implement. Now I need some time to work on this some more... -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
2009/10/5 Christopher Barker
Francesc Alted wrote:
A Saturday 03 October 2009 10:06:12 Christopher Barker escrigué:
This idea was inspired by a discussion at the SciPy conference, in which we spent a LOT of time during the numpy tutorial talking about how to accumulate values in an array when you don't know how big the array needs to be when you start.
What I have in mind is very simple. It would be: - Only 1-d - Support append() and extend() methods - support indexing and slicing - Support any valid numpy dtype - which could even get you pseudo n-d arrays... - maybe it would act like an array in other ways, I'm not so sure. - ufuncs, etc.
That's interesting. I'd normally use the `resize()` method for what you want, but indeed your approach is way more easy-to-use.
Of course, this is using resize() under the hood, but giving it an easier interface, but more importantly, it's adding the pre-allocation for you, and the code to deal with that. I suppose I should benchmark it, but I think calling resize(0 with every append would be a lot slower (though maybe not -- might the compiler/os be pre-allocating some extra memory anyway?)
I looked into this at some point, and under Linux, the malloc doesn't allocate substantial extra memory until you get big enough that it's allocating complete memory pages, at which point you get until the end of the page. At this point it's possible that adding more memory onto the end of the malloced region (and maybe even moving the array around in memory) can become really cheap, since it's just requesting more memory from the OS. Also, a friend who's a bare-metal programming wizard pointed out to me that modern malloc implementations really hate realloc, since it tends to put memory blocks in arenas intended for different sizes. I think that's only really an issue for shrinking blocks, since they probably just always allocate a new block when growing (unless they're in the pages-at-a-time regime). In short, I think it's better to have a python-list-like growing scheme. In fact it's maybe more important for arrays than python lists, since in a python list all that needs to be moved are pointers to the actual python objects, only ever a small fraction of the data volume.
I should profile this -- if you can call resize() with every new item, and it's not too slow, then it may not be worth writing this class at all (or I could make it simpler, maybe even an nd-array subclass instead.
Keep in mind the need for sensible handling of slices, since the underlying array will probably move on every resize. I think there's a need for this code.
If you are looking for performance improvements, I'd have a look at the `PyArray_Resize()` function in 'core/src/multiarray/shape.c' (trunk). It seems to me that the zero-initialization of added memory can be skipped, allowing for more performance for the `resize()` method (most specially for large size increments).
I suppose so, but I doubt that's causing any of my performance issues. Another thing to profile.
Probably worth profiling, yes - I wouldn't worry about the time taken writing zeros, but that does mean you have to touch all the allocated memory, which can't be too great for the cache. Anne
Christopher Barker wrote:
What do folks think? is this useful? What would you change, etc?
Chris - I really like this and find it useful. I would change the name to something like "growable" or "ArrayList" - accumulator seems like an object for cumulative summation. I think the right amount to grow is 2x - this provides an amortized O(log n) append. If the array doesn't have to grow, the cost is 1 - no copies - whereas if you have to grow, the cost is n copies. Is 2x optimal? Perhaps the configurable grow ratio is a good thing, although giving a knob means people are going to set it wrong. I would also vote "+1" for an ND version of this (growing only a single dimension). Keeping 2x for each of n dimensions, while conceivable, would be 2**n extra memory, and hence probably too costly. Cheers, Tom K. -- View this message in context: http://www.nabble.com/A-numpy-accumulator...-tp25726568p25762136.html Sent from the Numpy-discussion mailing list archive at Nabble.com.
Tom K. wrote:
Chris - I really like this and find it useful. I would change the name to something like "growable" or "ArrayList"
hmm. I think I like "growable" or maybe "growarray".
I think the right amount to grow is 2x -
I think that may be too much.. one if the key advantages of this over python lists is that there should be a memory use advantage -- when you are pushing memory bounds, using twice what you need is a bit much.
Perhaps the configurable grow ratio is a good thing, although giving a knob means people are going to set it wrong.
maybe, but most folk will use the default anyway. I'm certainly going to keep it configurable while under development -- the better to benchmark with.
I would also vote "+1" for an ND version of this (growing only a single dimension).
Yes, I think that is a good idea, and would certainly be useful for a common case -- growing a table of data, perhaps when reading a file, etc.
Keeping 2x for each of n dimensions, while conceivable, would be 2**n extra memory, and hence probably too costly.
That, and the fact that you'd have to move a bunch of memory around as it grew -- if you only grow the first dimension (for C order, anyway), you can just tack stuff on the end (which usually necessitates a copy anyway, but it still seems easier. thanks for the feedback, -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
participants (5)
-
Anne Archibald
-
Christopher Barker
-
Dag Sverre Seljebotn
-
Francesc Alted
-
Tom K.