[Numpy-discussion] array slicing questions

Tue Jul 31 09:23:34 EDT 2012

Hi,

On Tue, Jul 31, 2012 at 10:23 AM, Vlastimil Brom
<vlastimil.brom at gmail.com>wrote:

> 2012/7/30 eat <e.antero.tammi at gmail.com>:
> > Hi,
> >
> > A partial answer to your questions:
> >
> > On Mon, Jul 30, 2012 at 10:33 PM, Vlastimil Brom <
> vlastimil.brom at gmail.com>
> > wrote:
> >>
> >> Hi all,
> >> I'd like to ask for some hints or advice regarding the usage of
> >> numpy.array and especially  slicing.
> >>
> >> I only recently tried numpy and was impressed by the speedup in some
> >> parts of the code, hence I suspect, that I might miss some other
> >> oportunities in this area.
> >>
> >> I currently use the following code for a simple visualisation of the
> >> search matches within the text, the arrays are generally much larger
> >> than the sample - the texts size is generally hundreds of kilobytes up
> >> to a few MB - with an index position for each character.
> >> First there is a list of spans(obtained form the regex match objects),
> >> the respective character indices in between these slices should be set
> >> to 1:
> >>
> >> >>> import numpy
> >> >>> characters_matches = numpy.zeros(10)
> >> >>> matches_spans = numpy.array([[2,4], [5,9]])
> >> >>> for start, stop in matches_spans:
> >> ...     characters_matches[start:stop] = 1
> >> ...
> >> >>> characters_matches
> >> array([ 0.,  0.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  0.])
> >>
> >> Is there maybe a way tu achieve this in a numpy-only way - without the
> >> python loop?
> >> (I got the impression, the powerful slicing capabilities could make it
> >> possible, bud haven't found this kind of solution.)
> >>
> >>
> >> In the next piece of code all the character positions are evaluated
> >> with their "neighbourhood" and a kind of running proportions of the
> >> matched text parts are computed (the checks_distance could be
> >> generally up to the order of the half the text length, usually less :
> >>
> >> >>>
> >> >>> check_distance = 1
> >> >>> floating_checks_proportions = []
> >> >>> for i in numpy.arange(len(characters_matches)):
> >> ...     lo = i - check_distance
> >> ...     if lo < 0:
> >> ...         lo = None
> >> ...     hi = i + check_distance + 1
> >> ...     checked_sublist = characters_matches[lo:hi]
> >> ...     proportion = (checked_sublist.sum() / (check_distance * 2 +
> 1.0))
> >> ...     floating_checks_proportions.append(proportion)
> >> ...
> >> >>> floating_checks_proportions
> >> [0.0, 0.33333333333333331, 0.66666666666666663, 0.66666666666666663,
> >> 0.66666666666666663, 0.66666666666666663, 1.0, 1.0,
> >> 0.66666666666666663, 0.33333333333333331]
> >> >>>
> >
> > Define a function for proportions:
> >
> > from numpy import r_
> >
> > from numpy.lib.stride_tricks import as_strided as ast
> >
> > def proportions(matches, distance= 1):
> >
> >     cd, cd2p1, s= distance, 2* distance+ 1, matches.strides[0]
> >
> >     # pad
> >
> >     m= r_[[0.]* cd, matches, [0.]* cd]
> >
> >     # create a suitable view
> >
> >     m= ast(m, shape= (m.shape[0], cd2p1), strides= (s, s))
> >
> >     # average
> >
> >     return m[:-2* cd].sum(1)/ cd2p1
> > and use it like:
> > In []: matches
> > Out[]: array([ 0.,  0.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  0.])
> >
> > In []: proportions(matches).round(2)
> > Out[]: array([ 0.  ,  0.33,  0.67,  0.67,  0.67,  0.67,  1.  ,  1.  ,
>  0.67,
> > 0.33])
> > In []: proportions(matches, 5).round(2)
> > Out[]: array([ 0.27,  0.36,  0.45,  0.55,  0.55,  0.55,  0.55,  0.55,
>  0.45,
> > 0.36])
> >>
> >>
> >> I'd like to ask about the possible better approaches, as it doesn't
> >> look very elegant to me, and I obviously don't know the implications
> >> or possible drawbacks of numpy arrays in some scenarios.
> >>
> >> the pattern
> >> for i in range(len(...)): is usually considered inadequate in python,
> >> but what should be used in this case as the indices are primarily
> >> needed?
> >> is something to be gained or lost using (x)range or np.arange as the
> >> python loop is (probably?) inevitable anyway?
> >
> > Here np.arange(.) will create a new array and potentially wasting memory
> if
> > it's not otherwise used. IMO nothing wrong looping with xrange(.) (if you
> > really need to loop ;).
> >>
> >> Is there some mor elegant way to check for the "underflowing" lower
> >> bound "lo" to replace with None?
> >>
> >> Is it significant, which container is used to collect the results of
> >> the computation in the python loop - i.e. python list or a numpy
> >> array?
> >> (Could possibly matplotlib cooperate better with either container?)
> >>
> >> And of course, are there maybe other things, which should be made
> >> better/differently?
> >>
> >> (using Numpy 1.6.2, python 2.7.3, win XP)
> >
> >
> > My 2 cents,
> > -eat
> >>
> >> Thanks in advance for any hints or suggestions,
> >>    regards,
> >>   Vlastimil Brom
> >> _______________________________________________
> >> NumPy-Discussion mailing list
> >> NumPy-Discussion at scipy.org
> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> Hi,
> thank you very much for your suggestions!
>
> do I understand it correctly, that I have to special-case the function
> for distance = 0 (which should return the matches themselves without
> recalculation)?
>
Yes.

>
> However, more importantly, I am getting a ValueError for some larger,
> (but not completely unreasonable) "distance"
>
> >>> proportions(matches, distance= 8190)
> Traceback (most recent call last):
>   File "<input>", line 1, in <module>
>   File "<input>", line 11, in proportions
>   File "C:\Python27\lib\site-packages\numpy\lib\stride_tricks.py",
> line 28, in as_strided
>     return np.asarray(DummyArray(interface, base=x))
>   File "C:\Python27\lib\site-packages\numpy\core\numeric.py", line
> 235, in asarray
>     return array(a, dtype, copy=False, order=order)
> ValueError: array is too big.
> >>>
>
> the distance= 8189 was the largest which worked in this snippet,
> however, it might be data-dependent, as I got this error as well e.g.
> for distance=4529 for a 20k text.
>
> Is this implementation-limited, or could it be solved in some
> alternative way which wouldn't have such limits (up to the order of,
> say, millions)?
>
Apparently ast(.) does not return a view of the original matches rather a
copy of size (n* (2* distance+ 1)), thus you may run out of memory.

Surely it can be solved up to millions of matches, but perhaps much slower
speed.

Regards,
-eat

>
> Thanks again
>   regards
>     vbr
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120731/1e1d6dc5/attachment.html>