Mailman 3 A numpy accumulator... - NumPy-Discussion

3 Oct 2009

      OK -- this one I'm intending to send!

Hi all,

This idea was inspired by a discussion at the SciPy conference, in which
we spent a LOT of time during the numpy tutorial talking about how to
accumulate values in an array when you don't know how big the array
needs to be when you start.

The "standard practice" is to accumulate in a python list, then convert
the final result into an array. This is a good idea because Python lists
are standard, well tested, efficient, etc.

However, as was pointed out in that lengthy discussion, if what you are
doing is accumulating is a whole bunch of numbers (ints, floats,
whatever), or particularly if you need to accumulate a data type that
plain python doesn't support, there is a lot of overhead involved: a
python float type is pretty heavyweight. If performance or memory use is
  important, it might create issues. You can use and array.array, but it
doesn't support all numpy types, particularly custom dtypes.

I talked about this on the cython list (as someone asked how to do
accumulate in cython), and a few folks thought it would be useful, so I
put together a prototype.

What I have in mind is very simple. It would be:
   - Only 1-d
   - Support append() and extend() methods
   - support indexing and slicing
   - Support any valid numpy dtype
     - which could even get you pseudo n-d arrays...
   - maybe it would act like an array in other ways, I'm not so sure.
     - ufuncs, etc.

It could take the place of using python lists/arrays when you really
want a numpy array, but don't know how big it will be until you've
filled it.

The implementation I have now uses a regular numpy array as the
"buffer". The buffer is re-sized as needed with ndarray.resize(). I've
enclosed the class, a bunch of tests (This is the first time I've ever
really done test-driven development, though I wouldn't say that this is
a complete test suite).

A few notes about this implementation:

  * the name of the class could be better, and so could some of the
method names.

  * on further thought, I think it could handle n-d arrays, as long as
you only accumulated along the first index.

  * It could use a bunch more methods
    - deleting part of the array
    - math
    - probably anything supported by array.array would be good.

  * Robert pointed me to the array.array implementation to see how it
expands the buffer as you append. It did tricks to get it to grow fast
when the array is very small, then eventually to add about 1/16 of the
used array size to the buffer. I imagine that this would gets used
because you were likely to have a big array, so I didn't bother and
start with a buffer at 128 elements, then add 1/4 each time you need to
expand -- these are both tweakable attributes.

  * I'm keeping the buffer a hidden variable, and slicing and __array__
return copies - this is so that it won't get multiple references, and
then not be expandable.

  * I did a little simple profiling, and discovered that it's slower
than a python list by a factor of more than 2 (for accumulating python
ints, anyway). With a bit of experimentation, I think that's because of
a couple factors:
   - an extra function call -- the append() method needs to then do an
assignment to the buffer
   - Object conversion -- python lists store python objects, so the
python int can just go right in there. with numpy, it needs to be
converted to a C int first -- a bit if extra overhead. Though a straight
assignment into a pre-allocated array i faster than a list.

I think it's still an improvement for memory use.

Maybe it would be worth writing in C or Cython to avoid some of this. In
particular, it would be nice if you could use it in Cython, and put C
types directly it...

  * This could be pretty useful for things like genfromtxt.

What do folks think? is this useful? What would you change, etc?

-Chris

-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov

#!/usr/bin/env python

"""
accumulator class

Designed to be used as an expandable umpy array, so accumulate values, rather
than a python list.

Note that slices return copies, rather than views, unlike regular numpy arrays.
This is so that the buffer can be re-allocated without messing up any views.

"""
import numpy as np

class accumulator:
    #A few parameters
    DEFAULT_BUFFER_SIZE = 128
    BUFFER_EXTEND_SIZE = 1.25 # array.array uses 1+1/16 -- that seem small to me.
    def __init__(self, object=[], dtype=None, length=None):
        """
        proper docs here
        """
        buffer = np.array(object, dtype=dtype, copy=True)
        if buffer.ndim > 1:
            raise ValueError("accumulator only works with 1-d data")
        self.length = buffer.shape[0]
        ## dd the padding to the buffer
        buffer.resize( max(self.DEFAULT_BUFFER_SIZE, buffer.shape[0]*self.BUFFER_EXTEND_SIZE) )

        self.__buffer = buffer

    @property
    def dtype(self):
        return self.__buffer.dtype
    @property
    def buffersize(self):
        """
        the size of the internal buffer
        """
        return self.__buffer.size

    def __len__(self):
        return self.length

    def __array__(self, dtype=None):
        """
        a.__array__(|dtype) -> copy of array.

        Always returns a copy array, so that buffer doesn't have any references to it.
        """
        return np.array(self.__buffer[:self.length], dtype=dtype, copy=True)

    def append(self, item):
        """
        add a new item to the end of the array
        """
        try:
            self.__buffer[self.length] = item
            self.length += 1
        except IndexError: # the buffer is not big enough
            self.resize(self.length*self.BUFFER_EXTEND_SIZE)
            self.append(item)

    def extend(self, items):
        """
        add a sequence of new items to the end of the array
        """
        print "in extend"
        print "items:", items
        print "len (items):", len(items)

        try:
            self.__buffer[self.length:self.length+len(items)] = items
            self.length += len(items)
        except ValueError: # the buffer is not big enough
            self.resize((self.length+len(items))*self.BUFFER_EXTEND_SIZE)
            self.extend(items)

    def resize(self, newsize):
        """
        resize the internal buffer

        you might want to do this to speed things up if you know you want it
        to be a lot bigger eventually
        """
        if newsize < self.length:
            raise ValueError("accumulator buffer cannot be made smaller that the length of the data")
        self.__buffer.resize(newsize)

    def fitbuffer(self):
        """
        re-sizes the buffer so that it fits the data, rather than having extra space

        """
        self.__buffer.resize(self.length)

    def __getitem__(self, index):
        if index > self.length-1:
            raise IndexError("index out of bounds")
        elif index < 0:
            index = self.length-1
        return self.__buffer[index]

    def __getslice__(self, i, j):
        """
        a.__getslice__(i, j) <==> a[i:j]

        Use of negative indices is not supported.

        This returns a COPY, not a view, unlike numpy arrays
        This is required as the data buffer needs to be able to change.
        """
        if j > self.length:
            j = self.length
        return self.__buffer[i:j].copy()

    def __str__(self):
        return self.__buffer[:self.length].__str__()
    def __repr__(self):
        return "accumulator%s"%self.__buffer[:self.length].__repr__()[5:]

#!/usr/bin/env python

"""
tests for the accumulator class

designed to be run with nose
"""

import unittest
import numpy as np
from numpy.testing import assert_array_equal
from accumulator import accumulator

class test_init(unittest.TestCase):

    def test_nd(self):
        self.assertRaises(ValueError, accumulator,  ((1,2),(3,4) )  )

    def test_empty(self):
        a = accumulator()
        self.assertEqual( len(a), 0 )

    def test_simple(self):
        a = accumulator( (1,2,3) )
        self.assertEqual( len(a), 3 )

    def test_buffer_init(self):
        a = accumulator()
        self.assertEqual( a.buffersize, a.DEFAULT_BUFFER_SIZE )

    def test_dtype(self):
        a = accumulator(dtype=np.uint8)
        self.assertEqual( a.dtype, np.uint8 )

class test_indexing(unittest.TestCase):

    def test_simple_index(self):
        a = accumulator( (1,2,3,4,5) )
        self.assertEqual(a[1], 2)

    def test_neg_index(self):
        a = accumulator( (1,2,3,4,5) )
        self.assertEqual(a[-1], 5)

    def test_index_too_big(self):
        a = accumulator( (1,2,3,4,5) )
        # I can't figure out how to use asserRaises for a non-callable
        try:
            a[6]
        except IndexError:
            pass
        else:
            raise Exception("This test didn't raise an IndexError")

    def test_append_then_index(self):
        a = accumulator( () )
        for i in range(20):
            a.append(i)
        self.assertEqual(a[15], 15)

    def test_indexs_then_resize(self):
       """
       this here to see if having a view on teh buffer causes problems
       """
       a = accumulator( (1,2,3,4,5) )
       b = a[4]
       a.resize(1000)

class test_slicing(unittest.TestCase):

    def test_simple_slice(self):
        a = accumulator( (1,2,3,4,5) )
        assert_array_equal(a[1:3], np.array([2, 3]))

    def test_too_big_slice(self):
        b = np.array( (1.0, 3, 4, 5, 6) )
        a = accumulator( b )
        assert_array_equal(a[1:10], b[1:10])

    def test_full_slice(self):
        b = np.array( (1.0, 3, 4, 5, 6) )
        a = accumulator( b )
        assert_array_equal(a[:], b[:])

    def test_slice_then_resize(self):
       """
       this here to see if having a view on teh buffer causes problems
       """
       a = accumulator( (1,2,3,4,5) )
       b = a[2:4]
       a.resize(1000)

class test_append(unittest.TestCase):

    def test_append_length(self):
        a = accumulator( (1,2,3) )
        a.append(4)
        self.assertEqual(len(a), 4)

class test_extend(unittest.TestCase):

    def test_extend_length(self):
        a = accumulator( (1,2,3) )
        a.extend( (4, 5, 6) )
        self.assertEqual(len(a), 6)

    def test_extend_long(self):
        a = accumulator( range(100) )
        a.extend( range(100) )
        self.assertEqual(a.length, 200)

class test_resize(unittest.TestCase):

    def test_resize_longer(self):
        a = accumulator( (1,2,3) )
        a.resize(1000)
        self.assertEqual(a.buffersize, 1000)

    def test_resize_too_short(self):
        a = accumulator( (1,2,3,4,5,6,7,8) )
        self.assertRaises(ValueError, a.resize, 5)

    def test_fitbuffer(self):
        a = accumulator( (1,2,3) )
        a.fitbuffer()
        self.assertEqual(a.buffersize, 3)

class test__array__(unittest.TestCase):

    def test_asarray(self):
        a = accumulator( (1,2,3) )
        b = np.asarray(a)
        print b
        assert_array_equal(a[:], b)

    def test_asarray_dtype(self):
        a = accumulator( (1,2,3), dtype=np.uint32 )
        b = np.asarray(a, dtype=np.float)
        self.assertEqual(b.dtype, np.float)

class test_strings(unittest.TestCase):

    def test_str(self):
        a = accumulator( (1,2,3), dtype=np.float )
        self.assertEqual(str(a), '[ 1.  2.  3.]')

    def test_repr(self):
        a = accumulator( (1,2,3), dtype=np.float )
        self.assertEqual(repr(a), 'accumulator([ 1.,  2.,  3.])')

class test_dtypes(unittest.TestCase):

    def test_simple_dtype(self):
        a = accumulator( ('this', 'that', 'the' ,'other'), dtype='S10' )
        self.assertEqual(a[3], 'other')

    def test_custom_dtype(self):
        dt = np.dtype([('x','f4'), ('y','f4'), ('i', 'i2'), ('name', 'S32')])
        a = accumulator( dtype=dt )
        a.append( (3.5, 6.125, 4, 'something') )
        a.append( (0.5, 7, 5, 'nothing') )
        a.append( (7.1, 6.0, 7, 'what?') )
        self.assertEqual(tuple(a[1]), (0.5, 7.0, 5, 'nothing'))

#!/usr/bin/env python

"""
some code to help profile the accumulator class

designed to be run with ipython "timeit"

"""

import numpy as np
import accumulator
reload( accumulator)
from accumulator import accumulator

import array

def list1():
    """
    using a list to accumulate integers, then make an array out of it.
    """
    l = []
    for i in xrange(1000):
        l.append(i)
    return np.array(l)

def array1():
    """
    using a list to accumulate integers, then make an array out of it.
    """
    l = array.array('i')
    for i in xrange(1000):
        l.append(i)
    return np.array(l)

def accum1():
    """
    using an accumulator to accumulate integers, then make an array out of it.
    """
    l = accumulator((), dtype=np.int)
    for i in xrange(1000):
        l.append(i)
    return np.array(l)

def ndarray():
    """
    using an array, pre-allocated
    """
    l = np.empty((1000,), dtype=np.int)
    for i in xrange(1000):
        l[i] = i
    return np.array(l)

A numpy accumulator...

Christopher Barker

Dag Sverre Seljebotn

Francesc Alted

Christopher Barker

Anne Archibald

Tom K.

Christopher Barker

tags

participants (5)