[Numpy-discussion] vectorizing

Fri Jun 5 16:22:14 EDT 2009

On Fri, Jun 5, 2009 at 1:05 PM, Keith Goodman<kwgoodman at gmail.com> wrote:
> On Fri, Jun 5, 2009 at 1:01 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>> On Fri, Jun 5, 2009 at 12:53 PM,  <josef.pktd at gmail.com> wrote:
>>> On Fri, Jun 5, 2009 at 2:07 PM, Brian Blais <bblais at bryant.edu> wrote:
>>>> Hello,
>>>> I have a vectorizing problem that I don't see an obvious way to solve.  What
>>>> I have is a vector like:
>>>> obs=array([1,2,3,4,3,2,1,2,1,2,1,5,4,3,2])
>>>> and a matrix
>>>> T=zeros((6,6))
>>>> and what I want in T is a count of all of the transitions in obs, e.g.
>>>> T[1,2]=3 because the sequence 1-2 happens 3 times,  T[3,4]=1 because the
>>>> sequence 3-4 only happens once, etc...  I can do it unvectorized like:
>>>> for o1,o2 in zip(obs[:-1],obs[1:]):
>>>>     T[o1,o2]+=1
>>>>
>>>> which gives the correct answer from above, which is:
>>>> array([[ 0.,  0.,  0.,  0.,  0.,  0.],
>>>>        [ 0.,  0.,  3.,  0.,  0.,  1.],
>>>>        [ 0.,  3.,  0.,  1.,  0.,  0.],
>>>>        [ 0.,  0.,  2.,  0.,  1.,  0.],
>>>>        [ 0.,  0.,  0.,  2.,  0.,  0.],
>>>>        [ 0.,  0.,  0.,  0.,  1.,  0.]])
>>>>
>>>>
>>>> but I thought there would be a better way.  I tried:
>>>> o1=obs[:-1]
>>>> o2=obs[1:]
>>>> T[o1,o2]+=1
>>>> but this doesn't give a count, it just yields 1's at the transition points,
>>>> like:
>>>> array([[ 0.,  0.,  0.,  0.,  0.,  0.],
>>>>        [ 0.,  0.,  1.,  0.,  0.,  1.],
>>>>        [ 0.,  1.,  0.,  1.,  0.,  0.],
>>>>        [ 0.,  0.,  1.,  0.,  1.,  0.],
>>>>        [ 0.,  0.,  0.,  1.,  0.,  0.],
>>>>        [ 0.,  0.,  0.,  0.,  1.,  0.]])
>>>>
>>>> Is there a clever way to do this?  I could write a quick Cython solution,
>>>> but I wanted to keep this as an all-numpy implementation if I can.
>>>>
>>>
>>> histogram2d or its imitation, there was a discussion on histogram2d a
>>> short time ago
>>>
>>>>>> obs=np.array([1,2,3,4,3,2,1,2,1,2,1,5,4,3,2])
>>>>>> obs2 = obs - 1
>>>>>> trans = np.hstack((0,np.bincount(obs2[:-1]*6+6+obs2[1:]),0)).reshape(6,6)
>>>>>> re = np.array([[ 0.,  0.,  0.,  0.,  0.,  0.],
>>> ...         [ 0.,  0.,  3.,  0.,  0.,  1.],
>>> ...         [ 0.,  3.,  0.,  1.,  0.,  0.],
>>> ...         [ 0.,  0.,  2.,  0.,  1.,  0.],
>>> ...         [ 0.,  0.,  0.,  2.,  0.,  0.],
>>> ...         [ 0.,  0.,  0.,  0.,  1.,  0.]])
>>>>>> np.all(re == trans)
>>> True
>>>
>>>>>> trans
>>> array([[0, 0, 0, 0, 0, 0],
>>>       [0, 0, 3, 0, 0, 1],
>>>       [0, 3, 0, 1, 0, 0],
>>>       [0, 0, 2, 0, 1, 0],
>>>       [0, 0, 0, 2, 0, 0],
>>>       [0, 0, 0, 0, 1, 0]])
>>>
>>>
>>> or
>>>
>>>>>> h, e1, e2 = np.histogram2d(obs[:-1], obs[1:], bins=6, range=[[0,5],[0,5]])
>>>>>> re
>>> array([[ 0.,  0.,  0.,  0.,  0.,  0.],
>>>       [ 0.,  0.,  3.,  0.,  0.,  1.],
>>>       [ 0.,  3.,  0.,  1.,  0.,  0.],
>>>       [ 0.,  0.,  2.,  0.,  1.,  0.],
>>>       [ 0.,  0.,  0.,  2.,  0.,  0.],
>>>       [ 0.,  0.,  0.,  0.,  1.,  0.]])
>>>>>> h
>>> array([[ 0.,  0.,  0.,  0.,  0.,  0.],
>>>       [ 0.,  0.,  3.,  0.,  0.,  1.],
>>>       [ 0.,  3.,  0.,  1.,  0.,  0.],
>>>       [ 0.,  0.,  2.,  0.,  1.,  0.],
>>>       [ 0.,  0.,  0.,  2.,  0.,  0.],
>>>       [ 0.,  0.,  0.,  0.,  1.,  0.]])
>>>
>>>>>> np.all(re == h)
>>> True
>>
>> There's no way my list method can beat that. But by adding
>>
>> import psyco
>> psyco.full()
>>
>> I get a total speed up of a factor of 15 when obs is length 10000.
>
> Actually, it is faster:
>
> histogram:
>
>>> h, e1, e2 = np.histogram2d(obs[:-1], obs[1:], bins=6, range=[[0,5],[0,5]])
>>> timeit h, e1, e2 = np.histogram2d(obs[:-1], obs[1:], bins=6, range=[[0,5],[0,5]])
> 100 loops, best of 3: 4.14 ms per loop
>
> lists:
>
>>> timeit test(obs3, T3)
> 1000 loops, best of 3: 1.32 ms per loop
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>


here's a go:

import numpy as np
import random
from itertools import groupby

def test1(obs, T):
   for o1,o2 in zip(obs[:-1],obs[1:]):
       T[o1][o2] += 1
   return T


def test2(obs, T):
    s = zip(obs[:-1], obs[1:])
    for idx, g in groupby(sorted(s)):
        T[idx] = len(list(g))
    return T

obs = [random.randint(0, 5) for z in range(10000)]

print test2(obs, np.zeros((6, 6)))
print test1(obs, np.zeros((6, 6)))


##############

In [10]: timeit test1(obs, np.zeros((6, 6)))
100 loops, best of 3: 18.8 ms per loop

In [11]: timeit test2(obs, np.zeros((6, 6)))
100 loops, best of 3: 6.91 ms per loop