[Numpy-discussion] vectorizing

Fri Jun 5 17:01:39 EDT 2009

On Fri, Jun 5, 2009 at 1:22 PM, Brent Pedersen <bpederse at gmail.com> wrote:
> On Fri, Jun 5, 2009 at 1:05 PM, Keith Goodman<kwgoodman at gmail.com> wrote:
>> On Fri, Jun 5, 2009 at 1:01 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>>> On Fri, Jun 5, 2009 at 12:53 PM,  <josef.pktd at gmail.com> wrote:
>>>> On Fri, Jun 5, 2009 at 2:07 PM, Brian Blais <bblais at bryant.edu> wrote:
>>>>> Hello,
>>>>> I have a vectorizing problem that I don't see an obvious way to solve.  What
>>>>> I have is a vector like:
>>>>> obs=array([1,2,3,4,3,2,1,2,1,2,1,5,4,3,2])
>>>>> and a matrix
>>>>> T=zeros((6,6))
>>>>> and what I want in T is a count of all of the transitions in obs, e.g.
>>>>> T[1,2]=3 because the sequence 1-2 happens 3 times,  T[3,4]=1 because the
>>>>> sequence 3-4 only happens once, etc...  I can do it unvectorized like:
>>>>> for o1,o2 in zip(obs[:-1],obs[1:]):
>>>>>     T[o1,o2]+=1
>>>>>
>>>>> which gives the correct answer from above, which is:
>>>>> array([[ 0.,  0.,  0.,  0.,  0.,  0.],
>>>>>        [ 0.,  0.,  3.,  0.,  0.,  1.],
>>>>>        [ 0.,  3.,  0.,  1.,  0.,  0.],
>>>>>        [ 0.,  0.,  2.,  0.,  1.,  0.],
>>>>>        [ 0.,  0.,  0.,  2.,  0.,  0.],
>>>>>        [ 0.,  0.,  0.,  0.,  1.,  0.]])
>>>>>
>>>>>
>>>>> but I thought there would be a better way.  I tried:
>>>>> o1=obs[:-1]
>>>>> o2=obs[1:]
>>>>> T[o1,o2]+=1
>>>>> but this doesn't give a count, it just yields 1's at the transition points,
>>>>> like:
>>>>> array([[ 0.,  0.,  0.,  0.,  0.,  0.],
>>>>>        [ 0.,  0.,  1.,  0.,  0.,  1.],
>>>>>        [ 0.,  1.,  0.,  1.,  0.,  0.],
>>>>>        [ 0.,  0.,  1.,  0.,  1.,  0.],
>>>>>        [ 0.,  0.,  0.,  1.,  0.,  0.],
>>>>>        [ 0.,  0.,  0.,  0.,  1.,  0.]])
>>>>>
>>>>> Is there a clever way to do this?  I could write a quick Cython solution,
>>>>> but I wanted to keep this as an all-numpy implementation if I can.
>>>>>
>>>>
>>>> histogram2d or its imitation, there was a discussion on histogram2d a
>>>> short time ago
>>>>
>>>>>>> obs=np.array([1,2,3,4,3,2,1,2,1,2,1,5,4,3,2])
>>>>>>> obs2 = obs - 1
>>>>>>> trans = np.hstack((0,np.bincount(obs2[:-1]*6+6+obs2[1:]),0)).reshape(6,6)
>>>>>>> re = np.array([[ 0.,  0.,  0.,  0.,  0.,  0.],
>>>> ...         [ 0.,  0.,  3.,  0.,  0.,  1.],
>>>> ...         [ 0.,  3.,  0.,  1.,  0.,  0.],
>>>> ...         [ 0.,  0.,  2.,  0.,  1.,  0.],
>>>> ...         [ 0.,  0.,  0.,  2.,  0.,  0.],
>>>> ...         [ 0.,  0.,  0.,  0.,  1.,  0.]])
>>>>>>> np.all(re == trans)
>>>> True
>>>>
>>>>>>> trans
>>>> array([[0, 0, 0, 0, 0, 0],
>>>>       [0, 0, 3, 0, 0, 1],
>>>>       [0, 3, 0, 1, 0, 0],
>>>>       [0, 0, 2, 0, 1, 0],
>>>>       [0, 0, 0, 2, 0, 0],
>>>>       [0, 0, 0, 0, 1, 0]])
>>>>
>>>>
>>>> or
>>>>
>>>>>>> h, e1, e2 = np.histogram2d(obs[:-1], obs[1:], bins=6, range=[[0,5],[0,5]])
>>>>>>> re
>>>> array([[ 0.,  0.,  0.,  0.,  0.,  0.],
>>>>       [ 0.,  0.,  3.,  0.,  0.,  1.],
>>>>       [ 0.,  3.,  0.,  1.,  0.,  0.],
>>>>       [ 0.,  0.,  2.,  0.,  1.,  0.],
>>>>       [ 0.,  0.,  0.,  2.,  0.,  0.],
>>>>       [ 0.,  0.,  0.,  0.,  1.,  0.]])
>>>>>>> h
>>>> array([[ 0.,  0.,  0.,  0.,  0.,  0.],
>>>>       [ 0.,  0.,  3.,  0.,  0.,  1.],
>>>>       [ 0.,  3.,  0.,  1.,  0.,  0.],
>>>>       [ 0.,  0.,  2.,  0.,  1.,  0.],
>>>>       [ 0.,  0.,  0.,  2.,  0.,  0.],
>>>>       [ 0.,  0.,  0.,  0.,  1.,  0.]])
>>>>
>>>>>>> np.all(re == h)
>>>> True
>>>
>>> There's no way my list method can beat that. But by adding
>>>
>>> import psyco
>>> psyco.full()
>>>
>>> I get a total speed up of a factor of 15 when obs is length 10000.
>>
>> Actually, it is faster:
>>
>> histogram:
>>
>>>> h, e1, e2 = np.histogram2d(obs[:-1], obs[1:], bins=6, range=[[0,5],[0,5]])
>>>> timeit h, e1, e2 = np.histogram2d(obs[:-1], obs[1:], bins=6, range=[[0,5],[0,5]])
>> 100 loops, best of 3: 4.14 ms per loop
>>
>> lists:
>>
>>>> timeit test(obs3, T3)
>> 1000 loops, best of 3: 1.32 ms per loop
>> _______________________________________________
>> Numpy-discussion mailing list
>> Numpy-discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>
>
> here's a go:
>
> import numpy as np
> import random
> from itertools import groupby
>
> def test1(obs, T):
>   for o1,o2 in zip(obs[:-1],obs[1:]):
>       T[o1][o2] += 1
>   return T
>
>
> def test2(obs, T):
>    s = zip(obs[:-1], obs[1:])
>    for idx, g in groupby(sorted(s)):
>        T[idx] = len(list(g))
>    return T
>
> obs = [random.randint(0, 5) for z in range(10000)]
>
> print test2(obs, np.zeros((6, 6)))
> print test1(obs, np.zeros((6, 6)))
>
>
> ##############
>
> In [10]: timeit test1(obs, np.zeros((6, 6)))
> 100 loops, best of 3: 18.8 ms per loop
>
> In [11]: timeit test2(obs, np.zeros((6, 6)))
> 100 loops, best of 3: 6.91 ms per loop

Wait, you tested the list method with an array. Try

timeit test1(obs, np.zeros((6, 6)).tolist())

Probably best to move the array/list creation out of the timeit loop.
Then my method won't have to pay the cost of converting to a list :)