[Numpy-discussion] array slicing questions

Tue Jul 31 14:12:44 EDT 2012

2012/7/31 eat <e.antero.tammi at gmail.com>:
> Hi,
>
> On Tue, Jul 31, 2012 at 7:20 PM, Vlastimil Brom <vlastimil.brom at gmail.com>
> wrote:
>>
>> 2012/7/31 eat <e.antero.tammi at gmail.com>:
>> > Hi,
>> >
>> > On Tue, Jul 31, 2012 at 5:01 PM, Vlastimil Brom
>> > <vlastimil.brom at gmail.com>
>> > wrote:
>> >>
>> >> 2012/7/31 eat <e.antero.tammi at gmail.com>:
>> >> > Hi,
>> >> >
>> >> > On Tue, Jul 31, 2012 at 10:23 AM, Vlastimil Brom
>> >> > <vlastimil.brom at gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> 2012/7/30 eat <e.antero.tammi at gmail.com>:
>> >> >> > Hi,
>> >> >> >
>> >> >> > A partial answer to your questions:
>> >> >> >
>> >> >> > On Mon, Jul 30, 2012 at 10:33 PM, Vlastimil Brom
>> >> >> > <vlastimil.brom at gmail.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> Hi all,
>> >> >> >> I'd like to ask for some hints or advice regarding the usage of
>> >> >> >> numpy.array and especially  slicing.
>> >> >> >>
>> >> >> >> I only recently tried numpy and was impressed by the speedup in
>> >> >> >> some
>> >> >> >> parts of the code, hence I suspect, that I might miss some other
>> >> >> >> oportunities in this area.
>> >> >> >>
>> >> >> >> I currently use the following code for a simple visualisation of
>> >> >> >> the
>> >> >> >> search matches within the text, the arrays are generally much
>> >> >> >> larger
>> >> >> >> than the sample - the texts size is generally hundreds of
>> >> >> >> kilobytes
>> >> >> >> up
>> >> >> >> to a few MB - with an index position for each character.
>> >> >> >> First there is a list of spans(obtained form the regex match
>> >> >> >> objects),
>> >> >> >> the respective character indices in between these slices should
>> >> >> >> be
>> >> >> >> set
>> >> >> >> to 1:
>> >> >> >>
>> >> >> >> >>> import numpy
>> >> >> >> >>> characters_matches = numpy.zeros(10)
>> >> >> >> >>> matches_spans = numpy.array([[2,4], [5,9]])
>> >> >> >> >>> for start, stop in matches_spans:
>> >> >> >> ...     characters_matches[start:stop] = 1
>> >> >> >> ...
>> >> >> >> >>> characters_matches
>> >> >> >> array([ 0.,  0.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  0.])
>> >> >> >>
>> >> >> >> Is there maybe a way tu achieve this in a numpy-only way -
>> >> >> >> without
>> >> >> >> the
>> >> >> >> python loop?
>> >> >> >> (I got the impression, the powerful slicing capabilities could
>> >> >> >> make
>> >> >> >> it
>> >> >> >> possible, bud haven't found this kind of solution.)
>> >> >> >>
>> >> >> >>
>> >> >> >> In the next piece of code all the character positions are
>> >> >> >> evaluated
>> >> >> >> with their "neighbourhood" and a kind of running proportions of
>> >> >> >> the
>> >> >> >> matched text parts are computed (the checks_distance could be
>> >> >> >> generally up to the order of the half the text length, usually
>> >> >> >> less
>> >> >> >> :
>> >> >> >>
>> >> >> >> >>>
>> >> >> >> >>> check_distance = 1
>> >> >> >> >>> floating_checks_proportions = []
>> >> >> >> >>> for i in numpy.arange(len(characters_matches)):
>> >> >> >> ...     lo = i - check_distance
>> >> >> >> ...     if lo < 0:
>> >> >> >> ...         lo = None
>> >> >> >> ...     hi = i + check_distance + 1
>> >> >> >> ...     checked_sublist = characters_matches[lo:hi]
>> >> >> >> ...     proportion = (checked_sublist.sum() / (check_distance * 2
>> >> >> >> +
>> >> >> >> 1.0))
>> >> >> >> ...     floating_checks_proportions.append(proportion)
>> >> >> >> ...
>> >> >> >> >>> floating_checks_proportions
>> >> >> >> [0.0, 0.33333333333333331, 0.66666666666666663,
>> >> >> >> 0.66666666666666663,
>> >> >> >> 0.66666666666666663, 0.66666666666666663, 1.0, 1.0,
>> >> >> >> 0.66666666666666663, 0.33333333333333331]
>> >> >> >> >>>
>> >> >> >
>> >> >> > Define a function for proportions:
>> >> >> >
>> >> >> > from numpy import r_
>> >> >> >
>> >> >> > from numpy.lib.stride_tricks import as_strided as ast
>> >> >> >
>> >> >> > def proportions(matches, distance= 1):
>> >> >> >
>> >> >> >     cd, cd2p1, s= distance, 2* distance+ 1, matches.strides[0]
>> >> >> >
>> >> >> >     # pad
>> >> >> >
>> >> >> >     m= r_[[0.]* cd, matches, [0.]* cd]
>> >> >> >
>> >> >> >     # create a suitable view
>> >> >> >
>> >> >> >     m= ast(m, shape= (m.shape[0], cd2p1), strides= (s, s))
>> >> >> >
>> >> >> >     # average
>> >> >> >
>> >> >> >     return m[:-2* cd].sum(1)/ cd2p1
>> >> >> > and use it like:
>> >> >> > In []: matches
>> >> >> > Out[]: array([ 0.,  0.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  0.])
>> >> >> >
>> >> >> > In []: proportions(matches).round(2)
>> >> >> > Out[]: array([ 0.  ,  0.33,  0.67,  0.67,  0.67,  0.67,  1.  ,  1.
>> >> >> > ,
>> >> >> > 0.67,
>> >> >> > 0.33])
>> >> >> > In []: proportions(matches, 5).round(2)
>> >> >> > Out[]: array([ 0.27,  0.36,  0.45,  0.55,  0.55,  0.55,  0.55,
>> >> >> > 0.55,
>> >> >> > 0.45,
>> >> >> > 0.36])
>> >> >> >>
>> >> >> >>
>> >> >> >> I'd like to ask about the possible better approaches, as it
>> >> >> >> doesn't
>> >> >> >> look very elegant to me, and I obviously don't know the
>> >> >> >> implications
>> >> >> >> or possible drawbacks of numpy arrays in some scenarios.
>> >> >> >>
>> >> >> >> the pattern
>> >> >> >> for i in range(len(...)): is usually considered inadequate in
>> >> >> >> python,
>> >> >> >> but what should be used in this case as the indices are primarily
>> >> >> >> needed?
>> >> >> >> is something to be gained or lost using (x)range or np.arange as
>> >> >> >> the
>> >> >> >> python loop is (probably?) inevitable anyway?
>> >> >> >
>> >> >> > Here np.arange(.) will create a new array and potentially wasting
>> >> >> > memory
>> >> >> > if
>> >> >> > it's not otherwise used. IMO nothing wrong looping with xrange(.)
>> >> >> > (if
>> >> >> > you
>> >> >> > really need to loop ;).
>> >> >> >>
>> >> >> >> Is there some mor elegant way to check for the "underflowing"
>> >> >> >> lower
>> >> >> >> bound "lo" to replace with None?
>> >> >> >>
>> >> >> >> Is it significant, which container is used to collect the results
>> >> >> >> of
>> >> >> >> the computation in the python loop - i.e. python list or a numpy
>> >> >> >> array?
>> >> >> >> (Could possibly matplotlib cooperate better with either
>> >> >> >> container?)
>> >> >> >>
>> >> >> >> And of course, are there maybe other things, which should be made
>> >> >> >> better/differently?
>> >> >> >>
>> >> >> >> (using Numpy 1.6.2, python 2.7.3, win XP)
>> >> >> >
>> >> >> >
>> >> >> > My 2 cents,
>> >> >> > -eat
>> >> >> >>
>> >> >> >> Thanks in advance for any hints or suggestions,
>> >> >> >>    regards,
>> >> >> >>   Vlastimil Brom
>> >> >> >> _______________________________________________
>> >> >> >> NumPy-Discussion mailing list
>> >> >> >> NumPy-Discussion at scipy.org
>> >> >> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >> >> >
>> >> >> Hi,
>> >> >> thank you very much for your suggestions!
>> >> >>
>> >> >> do I understand it correctly, that I have to special-case the
>> >> >> function
>> >> >> for distance = 0 (which should return the matches themselves without
>> >> >> recalculation)?
>> >> >
>> >> > Yes.
>> >> >>
>> >> >>
>> >> >> However, more importantly, I am getting a ValueError for some
>> >> >> larger,
>> >> >> (but not completely unreasonable) "distance"
>> >> >>
>> >> >> >>> proportions(matches, distance= 8190)
>> >> >> Traceback (most recent call last):
>> >> >>   File "<input>", line 1, in <module>
>> >> >>   File "<input>", line 11, in proportions
>> >> >>   File "C:\Python27\lib\site-packages\numpy\lib\stride_tricks.py",
>> >> >> line 28, in as_strided
>> >> >>     return np.asarray(DummyArray(interface, base=x))
>> >> >>   File "C:\Python27\lib\site-packages\numpy\core\numeric.py", line
>> >> >> 235, in asarray
>> >> >>     return array(a, dtype, copy=False, order=order)
>> >> >> ValueError: array is too big.
>> >> >> >>>
>> >> >>
>> >> >> the distance= 8189 was the largest which worked in this snippet,
>> >> >> however, it might be data-dependent, as I got this error as well
>> >> >> e.g.
>> >> >> for distance=4529 for a 20k text.
>> >> >>
>> >> >> Is this implementation-limited, or could it be solved in some
>> >> >> alternative way which wouldn't have such limits (up to the order of,
>> >> >> say, millions)?
>> >> >
>> >> > Apparently ast(.) does not return a view of the original matches
>> >> > rather
>> >> > a
>> >> > copy of size (n* (2* distance+ 1)), thus you may run out of memory.
>> >> >
>> >> > Surely it can be solved up to millions of matches, but perhaps much
>> >> > slower
>> >> > speed.
>> >> >
>> >> >
>> >> > Regards,
>> >> > -eat
>> >> >>
>> >> >>
>> >> >> Thanks again
>> >> >>   regards
>> >> >>     vbr
>> >> >>
>> >> >> _______________________________________________
>> >> >> NumPy-Discussion mailing list
>> >> >> NumPy-Discussion at scipy.org
>> >> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >> >
>> >>
>> >> Thank you for the confirmation,
>> >> I'll wait and see, whether the current speed isn't actually already
>> >> acceptable for the most cases...
>> >> I could already gain a speedup by using the array.sum()  and other
>> >> features, maybe I will find yet other possibilities.
>> >
>> > I just cooked up some pure pyhton and running sum based solution, which
>> > actually may be faster and it scales quite well up to millions of
>> > matches:
>> >
>> > def proportions_p(matches, distance= 1):
>> >
>> >     cd, cd2p1= distance, 2* distance+ 1
>> >
>> >     m, r= [0]* cd+ matches+ [0]* (cd+ 1), [0]* len(matches)
>> >
>> >     s= sum(m[:cd2p1])
>> >
>> >     for k in xrange(len(matches)):
>> >
>> >         r[k]= s/ cd2p1
>> >
>> >         s-= m[k]
>> >
>> >         s+= m[cd2p1+ k]
>> >
>> >     return r
>> >
>> >
>> > Some verification and timings:
>> > In []: a= arange(1, 100000, dtype= float)
>> > In []: allclose(proportions(a, 1000), proportions_p(a.tolist(), 1000))
>> > Out[]: True
>> >
>> > In []: %timeit proportions(a, 1000)
>> > 1 loops, best of 3: 288 ms per loop
>> > In []: %timeit proportions_p(a.tolist(), 1000)
>> > 10 loops, best of 3: 66.2 ms per loop
>> >
>> > In []: a= arange(1, 1000000, dtype= float)
>> > In []: %timeit proportions(a, 10000)
>> > ------------------------------------------------------------
>> > Traceback (most recent call last):
>> > [snip]
>> > ValueError: array is too big.
>> >
>> > In []: %timeit proportions_p(a.tolist(), 10000)
>> > 1 loops, best of 3: 680 ms per loop
>> >
>> >
>> > Regards,
>> > -eat
>> >>
>> >>
>> >> regards,
>> >>
>> >>     vbr
>> >> _______________________________________________
>> >> NumPy-Discussion mailing list
>> >> NumPy-Discussion at scipy.org
>> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >
>> >
>> Thanks for further assistance;
>> I hope, I am not misunderstanding something in the code, but this
>> calculation of proportions is supposed to be run over an array of
>> either 0 or 1, rather than a range;
>> a test data would be something like:
>
> It really shouldn't matter what the test case numbers are. We are just
> calculating averages over of certain 'window', like:
> In []: proportions(arange(1, 4, dtype= float))
> Out[]: array([ 1.        ,  2.        ,  1.66666667])
> In []: (0.+ 1.+ 2.)/ 3
> Out[]: 1.0
> In []: (1.+ 2.+ 3.)/ 3
> Out[]: 2.0
> In []: (2.+ 3.+ 0.)/ 3
> Out[]: 1.6666666666666667
>
>
> Regards,
> -eat
>>
>>
>> import random
>> test_lst = [0,1]*500
>> random.shuffle(test_lst)
>>
>> For this data, I am not getting the same results like with my
>> previously posted python function.
>>
>> Thanks and regards
>>  vbr
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
>
Thanks,
and sorry for the stupid mistake,
I somehow forgot dtype=float in my testdata, which apparently caused
the errors and different output of the functions in question.
I hoped, it is fixed now.
 Regards,
   vbr