[Numpy-discussion] array slicing questions

Tue Jul 31 03:23:19 EDT 2012

2012/7/30 eat <e.antero.tammi at gmail.com>:
> Hi,
>
> A partial answer to your questions:
>
> On Mon, Jul 30, 2012 at 10:33 PM, Vlastimil Brom <vlastimil.brom at gmail.com>
> wrote:
>>
>> Hi all,
>> I'd like to ask for some hints or advice regarding the usage of
>> numpy.array and especially  slicing.
>>
>> I only recently tried numpy and was impressed by the speedup in some
>> parts of the code, hence I suspect, that I might miss some other
>> oportunities in this area.
>>
>> I currently use the following code for a simple visualisation of the
>> search matches within the text, the arrays are generally much larger
>> than the sample - the texts size is generally hundreds of kilobytes up
>> to a few MB - with an index position for each character.
>> First there is a list of spans(obtained form the regex match objects),
>> the respective character indices in between these slices should be set
>> to 1:
>>
>> >>> import numpy
>> >>> characters_matches = numpy.zeros(10)
>> >>> matches_spans = numpy.array([[2,4], [5,9]])
>> >>> for start, stop in matches_spans:
>> ...     characters_matches[start:stop] = 1
>> ...
>> >>> characters_matches
>> array([ 0.,  0.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  0.])
>>
>> Is there maybe a way tu achieve this in a numpy-only way - without the
>> python loop?
>> (I got the impression, the powerful slicing capabilities could make it
>> possible, bud haven't found this kind of solution.)
>>
>>
>> In the next piece of code all the character positions are evaluated
>> with their "neighbourhood" and a kind of running proportions of the
>> matched text parts are computed (the checks_distance could be
>> generally up to the order of the half the text length, usually less :
>>
>> >>>
>> >>> check_distance = 1
>> >>> floating_checks_proportions = []
>> >>> for i in numpy.arange(len(characters_matches)):
>> ...     lo = i - check_distance
>> ...     if lo < 0:
>> ...         lo = None
>> ...     hi = i + check_distance + 1
>> ...     checked_sublist = characters_matches[lo:hi]
>> ...     proportion = (checked_sublist.sum() / (check_distance * 2 + 1.0))
>> ...     floating_checks_proportions.append(proportion)
>> ...
>> >>> floating_checks_proportions
>> [0.0, 0.33333333333333331, 0.66666666666666663, 0.66666666666666663,
>> 0.66666666666666663, 0.66666666666666663, 1.0, 1.0,
>> 0.66666666666666663, 0.33333333333333331]
>> >>>
>
> Define a function for proportions:
>
> from numpy import r_
>
> from numpy.lib.stride_tricks import as_strided as ast
>
> def proportions(matches, distance= 1):
>
>     cd, cd2p1, s= distance, 2* distance+ 1, matches.strides[0]
>
>     # pad
>
>     m= r_[[0.]* cd, matches, [0.]* cd]
>
>     # create a suitable view
>
>     m= ast(m, shape= (m.shape[0], cd2p1), strides= (s, s))
>
>     # average
>
>     return m[:-2* cd].sum(1)/ cd2p1
> and use it like:
> In []: matches
> Out[]: array([ 0.,  0.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  0.])
>
> In []: proportions(matches).round(2)
> Out[]: array([ 0.  ,  0.33,  0.67,  0.67,  0.67,  0.67,  1.  ,  1.  ,  0.67,
> 0.33])
> In []: proportions(matches, 5).round(2)
> Out[]: array([ 0.27,  0.36,  0.45,  0.55,  0.55,  0.55,  0.55,  0.55,  0.45,
> 0.36])
>>
>>
>> I'd like to ask about the possible better approaches, as it doesn't
>> look very elegant to me, and I obviously don't know the implications
>> or possible drawbacks of numpy arrays in some scenarios.
>>
>> the pattern
>> for i in range(len(...)): is usually considered inadequate in python,
>> but what should be used in this case as the indices are primarily
>> needed?
>> is something to be gained or lost using (x)range or np.arange as the
>> python loop is (probably?) inevitable anyway?
>
> Here np.arange(.) will create a new array and potentially wasting memory if
> it's not otherwise used. IMO nothing wrong looping with xrange(.) (if you
> really need to loop ;).
>>
>> Is there some mor elegant way to check for the "underflowing" lower
>> bound "lo" to replace with None?
>>
>> Is it significant, which container is used to collect the results of
>> the computation in the python loop - i.e. python list or a numpy
>> array?
>> (Could possibly matplotlib cooperate better with either container?)
>>
>> And of course, are there maybe other things, which should be made
>> better/differently?
>>
>> (using Numpy 1.6.2, python 2.7.3, win XP)
>
>
> My 2 cents,
> -eat
>>
>> Thanks in advance for any hints or suggestions,
>>    regards,
>>   Vlastimil Brom
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
Hi,
thank you very much for your suggestions!

do I understand it correctly, that I have to special-case the function
for distance = 0 (which should return the matches themselves without
recalculation)?

However, more importantly, I am getting a ValueError for some larger,
(but not completely unreasonable) "distance"

>>> proportions(matches, distance= 8190)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "<input>", line 11, in proportions
  File "C:\Python27\lib\site-packages\numpy\lib\stride_tricks.py",
line 28, in as_strided
    return np.asarray(DummyArray(interface, base=x))
  File "C:\Python27\lib\site-packages\numpy\core\numeric.py", line
235, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: array is too big.
>>>

the distance= 8189 was the largest which worked in this snippet,
however, it might be data-dependent, as I got this error as well e.g.
for distance=4529 for a 20k text.

Is this implementation-limited, or could it be solved in some
alternative way which wouldn't have such limits (up to the order of,
say, millions)?

Thanks again
  regards
    vbr