Match beginning of two strings

Jeff Epler jepler at unpythonic.net
Sat Aug 2 23:23:12 EDT 2003


This is a naive implementation of the 'extract' function.
    def extract(a, b):
	m = min(len(a), len(b))
	for i in range(m):
	    if a[i] != b[i]:
		return a[:i]
	return a[:m]

Here's one that uses the new zip() function:
    def extract2(a, b):
	m = min(len(a), len(b))
	for i, ai, bi in (range(m), a, b):
	    if ai != bi: return a[:i]
	return a[:m]
.. unfortunately, it seems to be slower than the first method.  On my
machine (800MHz PIII):
$ python timeit.py -s 'import ravi' \
    'ravi.extract("abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopBHLHT")'
10000 loops, best of 3: 32.7 usec per loop

If your goal is actually something slightly different---for instance,
find the string from a list with the largest shared prefix to a given
string---then you probably need to research an efficient algorithm.*
Otherwise, you may be stuck with the above.  If you have 200GB, and each
line is 80 chars, you have about 2.7 billion lines.  If you call extract()
once per line, and it takes 32 usec, you're looking at 24 hours to run.

Writing extract as a C extension may be wise, you're likely to be able to
cut those 32 usec down to little more than Python function call overhead,
or <2usec per loop.  That makes your 2.7x10^9 calls only take 70 minutes.

Jeff
* The efficient algorithm for this problem might involve arranging the
  list of strings as a tree, which makes finding the right prefix for
  a given string against the list take only about lg(len(l)) steps,
  instead of len(l) steps.  This isn't so much a Python problem as a
  computer science problem, though.





More information about the Python-list mailing list