testing if a list contains a sublist

Tue Aug 16 08:53:58 EDT 2011

In article <8739h18rzj.fsf at dpt-info.u-strasbg.fr>,
 Alain Ketterlin <alain at dpt-info.u-strasbg.fr> wrote:

> Roy Smith <roy at panix.com> writes:
> 
> >> what is the best way to check if a given list (lets call it l1) is
> >> totally contained in a second list (l2)?
> 
> [...]
> > import re
> >
> > def sublist(l1, l2):
> >     s1 = ''.join(map(str, l1))
> >     s2 = ''.join(map(str, l2))
> >     return re.search(s1, s2)
> 
> This is complete nonsense (try it on [12] and [1,2]).

No, it's not complete nonsense, it works just fine for the limited data 
set the OP presented :-)  Of course, you are correct that it only works 
for single-digit integers, hence my "running and ducking" comment.

> The original problem is "string searching" (where strings here are
> sequences of numbers instead of characters). See
> 
> http://en.wikipedia.org/wiki/String_searching_algorithm
> 
> for any algorithm (Rabin-Karp seems appropriate to me).

Yes, of course this is fundamentally a string searching problem.  The 
problem is that while Python comes with an excellent tool for doing 
these kinds of string searches, its domain is limited to, as you pointed 
out, character strings.  However, for the limited data the OP presented, 
there is a trivial way to map the original data domain (single digit 
integers) into the domain the re library knows about (character 
strings).  My answer may be a sucky general-purpose solution, but it's 
also a neat hack, and sometimes a neat hack is what you need.

It also occurs to me that this hack could be expanded a bit to handle a 
much larger input domain.  If you had sequences of arbitrary (but 
hashable) objects, you could map this into a unicode string where each 
"character" is the 32-bit hash of of the corresponding input object, 
then run the regex search on that.

In theory, it should work.  In practice, I can see a few potential 
problems:

1) You might have to carefully craft your hash function to avoid magic 
unicode values like null or BOM.  No biggie there.

2) Hash collisions can lead to false positives, so you need to filter 
those out with a subsequent linear comparison step.  Of course, this is 
true of any kind of hash lookup.  No biggie there either.

3) While the re module is advertised to work on unicode data, I have no 
idea how efficient it is on arbitrary values.  I wouldn't be too 
surprised if it's tuned for "mostly ascii utf-8" and performance suffers 
on completely random strings.  If its first step is to transcode to 
utf-32 internally, I expect it would work just fine, but it would take 
some experimentation (or reading of the source code) to validate this 
assumption.