[Numpy-discussion] vectorizing

Fri Jun 5 17:59:09 EDT 2009

On Fri, Jun 5, 2009 at 2:01 PM, Keith Goodman<kwgoodman at gmail.com> wrote:
> On Fri, Jun 5, 2009 at 1:22 PM, Brent Pedersen <bpederse at gmail.com> wrote:
>> On Fri, Jun 5, 2009 at 1:05 PM, Keith Goodman<kwgoodman at gmail.com> wrote:
>>> On Fri, Jun 5, 2009 at 1:01 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>>>> On Fri, Jun 5, 2009 at 12:53 PM,  <josef.pktd at gmail.com> wrote:
>>>>> On Fri, Jun 5, 2009 at 2:07 PM, Brian Blais <bblais at bryant.edu> wrote:
>>>>>> Hello,
>>>>>> I have a vectorizing problem that I don't see an obvious way to solve.  What
>>>>>> I have is a vector like:
>>>>>> obs=array([1,2,3,4,3,2,1,2,1,2,1,5,4,3,2])
>>>>>> and a matrix
>>>>>> T=zeros((6,6))
>>>>>> and what I want in T is a count of all of the transitions in obs, e.g.
>>>>>> T[1,2]=3 because the sequence 1-2 happens 3 times,  T[3,4]=1 because the
>>>>>> sequence 3-4 only happens once, etc...  I can do it unvectorized like:
>>>>>> for o1,o2 in zip(obs[:-1],obs[1:]):
>>>>>>     T[o1,o2]+=1
>>>>>>
>>>>>> which gives the correct answer from above, which is:
>>>>>> array([[ 0.,  0.,  0.,  0.,  0.,  0.],
>>>>>>        [ 0.,  0.,  3.,  0.,  0.,  1.],
>>>>>>        [ 0.,  3.,  0.,  1.,  0.,  0.],
>>>>>>        [ 0.,  0.,  2.,  0.,  1.,  0.],
>>>>>>        [ 0.,  0.,  0.,  2.,  0.,  0.],
>>>>>>        [ 0.,  0.,  0.,  0.,  1.,  0.]])
>>>>>>
>>>>>>
>>>>>> but I thought there would be a better way.  I tried:
>>>>>> o1=obs[:-1]
>>>>>> o2=obs[1:]
>>>>>> T[o1,o2]+=1
>>>>>> but this doesn't give a count, it just yields 1's at the transition points,
>>>>>> like:
>>>>>> array([[ 0.,  0.,  0.,  0.,  0.,  0.],
>>>>>>        [ 0.,  0.,  1.,  0.,  0.,  1.],
>>>>>>        [ 0.,  1.,  0.,  1.,  0.,  0.],
>>>>>>        [ 0.,  0.,  1.,  0.,  1.,  0.],
>>>>>>        [ 0.,  0.,  0.,  1.,  0.,  0.],
>>>>>>        [ 0.,  0.,  0.,  0.,  1.,  0.]])
>>>>>>
>>>>>> Is there a clever way to do this?  I could write a quick Cython solution,
>>>>>> but I wanted to keep this as an all-numpy implementation if I can.
>>>>>>
>>>>>
>>>>> histogram2d or its imitation, there was a discussion on histogram2d a
>>>>> short time ago
>>>>>
>>>>>>>> obs=np.array([1,2,3,4,3,2,1,2,1,2,1,5,4,3,2])
>>>>>>>> obs2 = obs - 1
>>>>>>>> trans = np.hstack((0,np.bincount(obs2[:-1]*6+6+obs2[1:]),0)).reshape(6,6)
>>>>>>>> re = np.array([[ 0.,  0.,  0.,  0.,  0.,  0.],
>>>>> ...         [ 0.,  0.,  3.,  0.,  0.,  1.],
>>>>> ...         [ 0.,  3.,  0.,  1.,  0.,  0.],
>>>>> ...         [ 0.,  0.,  2.,  0.,  1.,  0.],
>>>>> ...         [ 0.,  0.,  0.,  2.,  0.,  0.],
>>>>> ...         [ 0.,  0.,  0.,  0.,  1.,  0.]])
>>>>>>>> np.all(re == trans)
>>>>> True
>>>>>
>>>>>>>> trans
>>>>> array([[0, 0, 0, 0, 0, 0],
>>>>>       [0, 0, 3, 0, 0, 1],
>>>>>       [0, 3, 0, 1, 0, 0],
>>>>>       [0, 0, 2, 0, 1, 0],
>>>>>       [0, 0, 0, 2, 0, 0],
>>>>>       [0, 0, 0, 0, 1, 0]])
>>>>>
>>>>>
>>>>> or
>>>>>
>>>>>>>> h, e1, e2 = np.histogram2d(obs[:-1], obs[1:], bins=6, range=[[0,5],[0,5]])
>>>>>>>> re
>>>>> array([[ 0.,  0.,  0.,  0.,  0.,  0.],
>>>>>       [ 0.,  0.,  3.,  0.,  0.,  1.],
>>>>>       [ 0.,  3.,  0.,  1.,  0.,  0.],
>>>>>       [ 0.,  0.,  2.,  0.,  1.,  0.],
>>>>>       [ 0.,  0.,  0.,  2.,  0.,  0.],
>>>>>       [ 0.,  0.,  0.,  0.,  1.,  0.]])
>>>>>>>> h
>>>>> array([[ 0.,  0.,  0.,  0.,  0.,  0.],
>>>>>       [ 0.,  0.,  3.,  0.,  0.,  1.],
>>>>>       [ 0.,  3.,  0.,  1.,  0.,  0.],
>>>>>       [ 0.,  0.,  2.,  0.,  1.,  0.],
>>>>>       [ 0.,  0.,  0.,  2.,  0.,  0.],
>>>>>       [ 0.,  0.,  0.,  0.,  1.,  0.]])
>>>>>
>>>>>>>> np.all(re == h)
>>>>> True
>>>>
>>>> There's no way my list method can beat that. But by adding
>>>>
>>>> import psyco
>>>> psyco.full()
>>>>
>>>> I get a total speed up of a factor of 15 when obs is length 10000.
>>>
>>> Actually, it is faster:
>>>
>>> histogram:
>>>
>>>>> h, e1, e2 = np.histogram2d(obs[:-1], obs[1:], bins=6, range=[[0,5],[0,5]])
>>>>> timeit h, e1, e2 = np.histogram2d(obs[:-1], obs[1:], bins=6, range=[[0,5],[0,5]])
>>> 100 loops, best of 3: 4.14 ms per loop
>>>
>>> lists:
>>>
>>>>> timeit test(obs3, T3)
>>> 1000 loops, best of 3: 1.32 ms per loop
>>> _______________________________________________
>>> Numpy-discussion mailing list
>>> Numpy-discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>
>>
>> here's a go:
>>
>> import numpy as np
>> import random
>> from itertools import groupby
>>
>> def test1(obs, T):
>>   for o1,o2 in zip(obs[:-1],obs[1:]):
>>       T[o1][o2] += 1
>>   return T
>>
>>
>> def test2(obs, T):
>>    s = zip(obs[:-1], obs[1:])
>>    for idx, g in groupby(sorted(s)):
>>        T[idx] = len(list(g))
>>    return T
>>
>> obs = [random.randint(0, 5) for z in range(10000)]
>>
>> print test2(obs, np.zeros((6, 6)))
>> print test1(obs, np.zeros((6, 6)))
>>
>>
>> ##############
>>
>> In [10]: timeit test1(obs, np.zeros((6, 6)))
>> 100 loops, best of 3: 18.8 ms per loop
>>
>> In [11]: timeit test2(obs, np.zeros((6, 6)))
>> 100 loops, best of 3: 6.91 ms per loop
>
> Wait, you tested the list method with an array. Try
>
> timeit test1(obs, np.zeros((6, 6)).tolist())
>
> Probably best to move the array/list creation out of the timeit loop.
> Then my method won't have to pay the cost of converting to a list :)
>

ah right, your test1 is faster than test2 that way.
i'm just stoked to find out about ndimage.historgram.

using this:
>>> histogram((obs[:-1])*6 +obs[1:], 0, 36, 36).reshape(6,6)

it gives the exact result as test1/test2.

-b

 _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>