"in" operator for strings

Fri Feb 2 12:01:54 EST 2001

"Magnus Lie Hetland" <mlh at idi.ntnu.no> wrote in message
news:95ebcv$bn$1 at tyfon.itea.ntnu.no...
    [snip]
> > asked for (though the original poster may not have thought
> > of this 'obvious' generalization, specialcasing string would
> > surely not be warranted).  Unfortunately, for general cases
> > it doesn't scale well -- i.e., now:
>
> Well... Then "in" would test for both membership and
> subsequenceness... Quite a strong case of ambiguity if you
> ask me. In my opinion a really *bad* idea. (And would it

Right.

> only allow contiguous subsequences, or any subsequence?
> If we ever get built-in sets contiguity would be meningless,
> as would it be if we get "in" tests for dictionaries.)

Dictionaries aren't sequences, nor will sets be, so the
issue doesn't apply to either.  Still, you're right that,
*for sequences*, there are TWO definitions of subsequence
that are meaningful in different context -- contiguous
(aka compact) and generalized.

>>> print [1,2,4] in subseqOf(range(7), contiguous=1)
0
>>> print [1,2,4] in subseqOf(range(7), contiguous=0)
1

Implementation of class subseqOf is left as an
exercise to the reader:-).  Actually, a very simple
minded one IS simple to implement -- it's actually
simpler having subseqOf be a factory function, and:

def subseqOf(seq, contiguous=None):
    if contiguous: return contSubs(seq)
    else: return genrSubs(seq)

class genrSubs:
    def __init__(self,seq):
        self.seq=seq
    def __contains__(self,subs):
        start = 0
        for x in subs:
            try: where=self.seq[start:].index(x)
            except ValueError: return 0
            start = where+1
        return 1

class contSubs:
    def __init__(self,seq):
        self.seq=seq
        self.len=len(seq)
    def __contains__(self,subs):
        slen = len(subs)
        for i in range(self.len-slen):
            if self.seq[i:i+slen]==subs:
                return 1
        return 0

or thereabouts (warning, untested code!).
(One CAN, of course, do better than this!-).

Implementing __getitem__, to enumerate all
subsequences (contiguous and non) is also
of some interest -- a typical combinatorics
problem.  Generalized subsequences are a
picnic -- thanks to the natural isomorphism
with binary numbers in base self.len (I've
posted snippets using that to get "all
combinations" pretty recently).  Contiguous
subsequences are more interesting -- we
presumably still want to see the empty one
just once, so we'll single that out as
subsequence number 0 of any sequence, then
get to the real business of NON-empty subs.

Here, we want of course to count in a
_triangular_ way -- for a sequence of len
N, we'll have N subsequences that start
at 0, N-1 that start at 1, ..., N-x that
start at x, etc.  So, given a subsequence
index X, iterative computation of subseq
start and length is easy (again, untested
code...):
    if X==0: #special-case it
        return self.seq[0:0]
    X -= 1
    start = 0
    maxlen = N
    while start < N:
        if X < maxlen:
            return self.seq[start:start+X]
        X -= maxlen
        start += 1
        maxlen -= 1
    raise IndexError
but, that's lazy -- surely some closed form
exists (but there is no space in the margins
of this post for me to write it down...).

> > Also, of course, this would throw any parallel between
> > "x in y" and "for x in y" out of the windows unless the
> > latter starts looping on all *subsequences* -- eeep!-)
>
> Actually, that sounds very interesting ;-)
>
> How about having a power-sequence-function (like the standard
> power-set function in mathematics)? Then pow(seq) or seq.pow()
> would return a (perhaps lazy) sequence containing all
> subsequences... And one could have seq.pow(contiguous=1) if
> one only wanted contiguous subsequences... Then one would
> have:
>
>   >>> "Waldo" in "Ralph Waldo Emerson".pow(contiguous=1)
>   1
>
> An ingenious implementation of the laziness would be needed
> I guess... Or maybe not.

Not for the generalized case, unless you count the fact
of knowing that subsets of an N-element set, and binary
numbers of N bits, have a natural isomorphism, as in
some way "ingenious":-).  Of course one could cheat in
the enumeration for the more interesting contiguous case,
since the index being asked for only goes up by 1 at a
time in a for loop; but finding a closed-form expression
of "N-th contiguous subsequence" sounds much more fun,
and it might in fact have some use (e.g., "get a random
contiguous subsequence with all contiguous subsequences
having equal probability" -- if you can enumerate such
subsequences in closed form then that's a solved problem
too).

> A simple string-matching algorithm
> would be needed for the contiguous case, and a O(n*m)
> exhaustive search for the non-contiguous case. (Or perhaps
> something better would be possible by putting the elements
> in a hash table... O(n+m)? Might have quite some overhead,
> perhaps... Oh, well)

I suspect a modified Bayer-Moore could be very fast and
effective for contiguous-subsequence checks -- sublinear
just like "real" Bayer-Moore!-)

For the non-contiguous case, you could have a map
    'alphabet' element -> places where that element is
which is just O(N).  Now, you walk down the subsequence
you're checking, keeping only a counter of "farthest
left we could possibly be in the big sequence so far",
which you successively increase to the smallest item
in "places" that is larger than the current counter --
fail if you ever find an item in the subsequence that
is not in the alphabet, or if the restricted 'places'
is ever empty -- succeed if you get to the end.  This
looks a bit worse than O(M) in time, but not by much
unless there is a LOT of repetitiousness in the big
sequence; with bisection (binary search) over the
'places' sequences, O(M log N) appears like an easy
upperbound, and maybe it can be cut down further.

Of course, this only makes any sense for REALLY BIG
sequences.  The factory function could check for that,
returning a simpleminded wrapper class for sequences
up to some threshold length, and a sophisticated one
for longer sequences...!-)

> > >    "Waldo" in split("Ralph Waldo Emerson")
> > >
> > > It might be old-fashioned, but... So what :-)
> >
> > So it doesn't work unless you "from string import *" (horrid
> > idea), "from string import split" (doubtful), or rewrite it
> > using an explicit string.split (probably best,
>
> Why? This would in my opinion clearly depend on the size of
> the script... If you don't expect another split function to
> appear, I think it's quite OK to use
>
>   from string import split

In a small module, maybe.  Remember the status of local
import statements is in doubt... they may be disallowed
in Python 2.1 (hopefully they won't be, but, who knows),
so it's not clear that you could do this inside a function.
"doubtful" seems a fair assessment to me -- there ARE
doubts, depending on context; it's not necessarily "quite
OK" -- it may or may not be.

Why not just use string.split and play it safe...?\

Or, just as safe, the string-method...?

> I mean -- I'm not against string methods. They're nice. I
> just get a bit woozy seeing them called on literals. It

I think we'll all get used to that in time.

> just gives me flashes of weird stuff like
>
>   1.plus(2)
>
> And... In my mind methods seem to be for modifying an
> objects state first of all. Oh, well. Here we go -- another
> pointless discussion. Sorry :)

Why would mutators be any more worthy of syntax-sugar
support than accessors?  You've lost me here...

Alex