[Numpy-discussion] vectorizing

Fri Jun 5 21:09:14 EDT 2009

On Fri, Jun 5, 2009 at 5:59 PM, Brent Pedersen <bpederse at gmail.com> wrote:
> On Fri, Jun 5, 2009 at 2:01 PM, Keith Goodman<kwgoodman at gmail.com> wrote:
>> On Fri, Jun 5, 2009 at 1:22 PM, Brent Pedersen <bpederse at gmail.com> wrote:
>>> On Fri, Jun 5, 2009 at 1:05 PM, Keith Goodman<kwgoodman at gmail.com> wrote:
>>>> On Fri, Jun 5, 2009 at 1:01 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>>>>> On Fri, Jun 5, 2009 at 12:53 PM,  <josef.pktd at gmail.com> wrote:
>>>>>> On Fri, Jun 5, 2009 at 2:07 PM, Brian Blais <bblais at bryant.edu> wrote:
>>>>>>> Hello,
>>>>>>> I have a vectorizing problem that I don't see an obvious way to solve.  What
>>>>>>> I have is a vector like:
>>>>>>> obs=array([1,2,3,4,3,2,1,2,1,2,1,5,4,3,2])
>>>>>>> and a matrix
>>>>>>> T=zeros((6,6))
>>>>>>> and what I want in T is a count of all of the transitions in obs, e.g.
>>>>>>> T[1,2]=3 because the sequence 1-2 happens 3 times,  T[3,4]=1 because the
>>>>>>> sequence 3-4 only happens once, etc...  I can do it unvectorized like:
>>>>>>> for o1,o2 in zip(obs[:-1],obs[1:]):
>>>>>>>     T[o1,o2]+=1
>>>>>>>
>>>>>>> which gives the correct answer from above, which is:
>>>>>>> array([[ 0.,  0.,  0.,  0.,  0.,  0.],
>>>>>>>        [ 0.,  0.,  3.,  0.,  0.,  1.],
>>>>>>>        [ 0.,  3.,  0.,  1.,  0.,  0.],
>>>>>>>        [ 0.,  0.,  2.,  0.,  1.,  0.],
>>>>>>>        [ 0.,  0.,  0.,  2.,  0.,  0.],
>>>>>>>        [ 0.,  0.,  0.,  0.,  1.,  0.]])
>>>>>>>
>>>>>>>
>>>>>>> but I thought there would be a better way.  I tried:
>>>>>>> o1=obs[:-1]
>>>>>>> o2=obs[1:]
>>>>>>> T[o1,o2]+=1
>>>>>>> but this doesn't give a count, it just yields 1's at the transition points,
>>>>>>> like:
>>>>>>> array([[ 0.,  0.,  0.,  0.,  0.,  0.],
>>>>>>>        [ 0.,  0.,  1.,  0.,  0.,  1.],
>>>>>>>        [ 0.,  1.,  0.,  1.,  0.,  0.],
>>>>>>>        [ 0.,  0.,  1.,  0.,  1.,  0.],
>>>>>>>        [ 0.,  0.,  0.,  1.,  0.,  0.],
>>>>>>>        [ 0.,  0.,  0.,  0.,  1.,  0.]])
>>>>>>>
>>>>>>> Is there a clever way to do this?  I could write a quick Cython solution,
>>>>>>> but I wanted to keep this as an all-numpy implementation if I can.
>>>>>>>
>>>>>>
>>>>>> histogram2d or its imitation, there was a discussion on histogram2d a
>>>>>> short time ago
>>>>>>
>>>>>>>>> obs=np.array([1,2,3,4,3,2,1,2,1,2,1,5,4,3,2])
>>>>>>>>> obs2 = obs - 1
>>>>>>>>> trans = np.hstack((0,np.bincount(obs2[:-1]*6+6+obs2[1:]),0)).reshape(6,6)
>>>>>>>>> re = np.array([[ 0.,  0.,  0.,  0.,  0.,  0.],
>>>>>> ...         [ 0.,  0.,  3.,  0.,  0.,  1.],
>>>>>> ...         [ 0.,  3.,  0.,  1.,  0.,  0.],
>>>>>> ...         [ 0.,  0.,  2.,  0.,  1.,  0.],
>>>>>> ...         [ 0.,  0.,  0.,  2.,  0.,  0.],
>>>>>> ...         [ 0.,  0.,  0.,  0.,  1.,  0.]])
>>>>>>>>> np.all(re == trans)
>>>>>> True
>>>>>>
>>>>>>>>> trans
>>>>>> array([[0, 0, 0, 0, 0, 0],
>>>>>>       [0, 0, 3, 0, 0, 1],
>>>>>>       [0, 3, 0, 1, 0, 0],
>>>>>>       [0, 0, 2, 0, 1, 0],
>>>>>>       [0, 0, 0, 2, 0, 0],
>>>>>>       [0, 0, 0, 0, 1, 0]])
>>>>>>
>>>>>>
>>>>>> or
>>>>>>
>>>>>>>>> h, e1, e2 = np.histogram2d(obs[:-1], obs[1:], bins=6, range=[[0,5],[0,5]])
>>>>>>>>> re
>>>>>> array([[ 0.,  0.,  0.,  0.,  0.,  0.],
>>>>>>       [ 0.,  0.,  3.,  0.,  0.,  1.],
>>>>>>       [ 0.,  3.,  0.,  1.,  0.,  0.],
>>>>>>       [ 0.,  0.,  2.,  0.,  1.,  0.],
>>>>>>       [ 0.,  0.,  0.,  2.,  0.,  0.],
>>>>>>       [ 0.,  0.,  0.,  0.,  1.,  0.]])
>>>>>>>>> h
>>>>>> array([[ 0.,  0.,  0.,  0.,  0.,  0.],
>>>>>>       [ 0.,  0.,  3.,  0.,  0.,  1.],
>>>>>>       [ 0.,  3.,  0.,  1.,  0.,  0.],
>>>>>>       [ 0.,  0.,  2.,  0.,  1.,  0.],
>>>>>>       [ 0.,  0.,  0.,  2.,  0.,  0.],
>>>>>>       [ 0.,  0.,  0.,  0.,  1.,  0.]])
>>>>>>
>>>>>>>>> np.all(re == h)
>>>>>> True
>>>>>
>>>>> There's no way my list method can beat that. But by adding
>>>>>
>>>>> import psyco
>>>>> psyco.full()
>>>>>
>>>>> I get a total speed up of a factor of 15 when obs is length 10000.
>>>>
>>>> Actually, it is faster:
>>>>
>>>> histogram:
>>>>
>>>>>> h, e1, e2 = np.histogram2d(obs[:-1], obs[1:], bins=6, range=[[0,5],[0,5]])
>>>>>> timeit h, e1, e2 = np.histogram2d(obs[:-1], obs[1:], bins=6, range=[[0,5],[0,5]])
>>>> 100 loops, best of 3: 4.14 ms per loop
>>>>
>>>> lists:
>>>>
>>>>>> timeit test(obs3, T3)
>>>> 1000 loops, best of 3: 1.32 ms per loop
>>>> _______________________________________________
>>>> Numpy-discussion mailing list
>>>> Numpy-discussion at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>
>>>
>>>
>>> here's a go:
>>>
>>> import numpy as np
>>> import random
>>> from itertools import groupby
>>>
>>> def test1(obs, T):
>>>   for o1,o2 in zip(obs[:-1],obs[1:]):
>>>       T[o1][o2] += 1
>>>   return T
>>>
>>>
>>> def test2(obs, T):
>>>    s = zip(obs[:-1], obs[1:])
>>>    for idx, g in groupby(sorted(s)):
>>>        T[idx] = len(list(g))
>>>    return T
>>>
>>> obs = [random.randint(0, 5) for z in range(10000)]
>>>
>>> print test2(obs, np.zeros((6, 6)))
>>> print test1(obs, np.zeros((6, 6)))
>>>
>>>
>>> ##############
>>>
>>> In [10]: timeit test1(obs, np.zeros((6, 6)))
>>> 100 loops, best of 3: 18.8 ms per loop
>>>
>>> In [11]: timeit test2(obs, np.zeros((6, 6)))
>>> 100 loops, best of 3: 6.91 ms per loop
>>
>> Wait, you tested the list method with an array. Try
>>
>> timeit test1(obs, np.zeros((6, 6)).tolist())
>>
>> Probably best to move the array/list creation out of the timeit loop.
>> Then my method won't have to pay the cost of converting to a list :)
>>
>
> ah right, your test1 is faster than test2 that way.
> i'm just stoked to find out about ndimage.historgram.
>
> using this:
>>>> histogram((obs[:-1])*6 +obs[1:], 0, 36, 36).reshape(6,6)
>
> it gives the exact result as test1/test2.
>
> -b
>

the cleaned up bincount version looks competitive, this time with
tests to avoid index errors

Josef

import numpy as np
from scipy import ndimage

def countndi(obs):
    return ndimage.histogram((obs[:-1])*6 +obs[1:], 0, 36, 36 ).reshape(6,6)

def countbc(obs):
    tr = np.zeros(len(obs)+1,int)
    tr[2:] = (obs[:-1])*6 + obs[1:]
    tr[1] = 35
    trans = np.bincount(tr)
    trans[0] -= 1
    trans[-1] -= 1
    return trans.reshape(6,6)


obs=np.array([1,2,3,4,3,2,1,2,1,2,1,5,4,3,2])

re = np.array([[ 0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  3.,  0.,  0.,  1.],
        [ 0.,  3.,  0.,  1.,  0.,  0.],
        [ 0.,  0.,  2.,  0.,  1.,  0.],
        [ 0.,  0.,  0.,  2.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  1.,  0.]])

trans = countndi(obs)
print np.all(re == trans)
trans = countbc(obs)
print np.all(re == trans)

obs2=np.array([1,1,3,4,3,2,1,2,1,2,1,5,4,3,2])
transnd = countndi(obs2)
transbc = countbc(obs2)
print np.all(transnd == transbc)

obs3=np.array([1,2,3,4,3,2,1,2,1,2,1,5,5,3,2])
transnd = countndi(obs3)
transbc = countbc(obs3)
print np.all(transnd == transbc)


import time

nit = 20000
t = time.time()
for i in xrange(nit):
    countndi(obs)
print time.time() - t

t2 = time.time()
for i in xrange(nit):
    countbc(obs)
print time.time() - t2