(I clicked send too early the last time -- sorry about that!) Hi all, This idea was inspired by a discussion at the SciPy conference, in which we spent a LOT of time during the numpy tutorial talking about how to accumulate values in an array when you don't know how big the array needs to be when you start. The "standard practice" is to accumulate in a python list, then convert the final result into an array. This is a good idea because Python lists are standard, well tested, efficient, etc. However, as was pointed out in that lengthy discussion, if what you are doing is accumulating is a whole bunch of numbers (ints, floats, whatever), or particularly if you need to accumulate a data type that plain python doesn't support, there is a lot of overhead involved: a python float type is pretty heavyweight. If performance or memory use is important, it might create issues. You can use and array.array, but it doesn't support all numpy types, particularly custom dtypes. I talked about this on the cython list (as someone asked how to do accumulate in cython), and a few folks thought it would be useful, so I put together a prototype. What I have in mind is very simple. It would be: - Only 1-d - Support append() and extend() methods - support indexing and slicing - Support any valid numpy dtype - which could even get you pseudo n-d arrays... - maybe it would act like an array in other ways, I'm not so sure. - ufuncs, etc. It could take the place of using python lists/arrays when you really want a numpy array, but don't know how big it will be until you've filled it. The implementation I have now uses a regular numpy array as the "buffer". The buffer is re-sized as needed with ndarray.resize(). I've enclosed the class, a bunch of tests (This is the first time I've ever really done test-driven development, though I wouldn't say that this is a complete test suite). A few notes about this implementation: * the name of the class could be better, and so could some of the method names. * on further thought, I think it could handle n-d arrays, as long as you only accumulated along the first index. * It could use a bunch more methods - deleting part of the array - math - probably anything supported by array.array would be good. * Robert pointed me to the array.array implimentation to see how it expands the buffer as you append. It did tricks to get it to grow fast when the array is very small, then eventually to add about 1/16 of the used array size to the buffer. I imagine that this would gets used because you were likely to have a big array, so I didn't bother and start with a buffer at 128 elements, then add 1/4 each time you need to expand -- these are both tweakable attributes. * I'm keeping the buffer a hidden variable, and slicing and __array__ return copies - this is so that it won't get multiple references, and then not be expandable. * I did a little simple profiling, and discovered that it's slower than a python list by a factor of more than 2 (for accumulating python ints, anyway). With a bit of experimentation, I think that's because of a couple factors: - an extra function call -- the append() method needs to then do an assignment to the buffer - Object conversion -- python lists store python objects, so the python int can just go right in there. with numpy, it needs to be converted to a C int first -- a bit if extra overhead. Though a straight assignment into a pre-allocated array i faster than a list. I think it's still an improvement for memory use. Maybe it would be worth writing in C or Cython to avoid some of this. In particular, it would be nice if you could use it in Cython, and put C types directly it... * This could be pretty useful for things like genfromtxt. What do folks think? is this useful? What would you change, etc? -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
participants (1)
-
Christopher Barker