Match beginning of two strings

Mon Aug 4 15:27:25 EDT 2003

On Mon, 04 Aug 2003 11:56:04 GMT, Alex Martelli <aleax at aleax.it> wrote:

>Ravi wrote:
>
>> Hi,
>> 
>> I have about 200GB of data that I need to go through and extract the
>> common first part of a line. Something like this.
>> 
>>  >>>a = "abcdefghijklmnopqrstuvwxyz"
>>  >>>b = "abcdefghijklmnopBHLHT"
>>  >>>c = extract(a,b)
>>  >>>print c
>> "abcdefghijklmnop"
>> 
>> Here I want to extract the common string "abcdefghijklmnop". Basically I
>> need a fast way to do that for any two given strings. For my situation,
>> the common string will always be at the beginning of both strings. I can
>
>Here's my latest study on this:
>
>*** pexa.py:
>
[...]

JFTHOI, if you have the inclination, I'm curious how this slightly
different 2.3-dependent version would fare in your harness on your
system with the rest:

def commonprefix(s1, s2): # very little tested!
    try:
        for i, c in enumerate(s1):
            if c != s2[i]: return s1[:i]
    except IndexError:
        return s1[:i]
    return s1

[...]

>
>and my measurements give me:
>
>[alex at lancelot exi]$ python -O timeit.py -s 'import pexa' \
>> 'pexa.extract("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
>100000 loops, best of 3: 2.39 usec per loop
>[alex at lancelot exi]$ python -O timeit.py -s 'import pexa'
>'pexa.extract("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
>100000 loops, best of 3: 2.14 usec per loop
>[alex at lancelot exi]$ python -O timeit.py -s 'import pexa'
>'pexa.extract2("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
>10000 loops, best of 3: 30.2 usec per loop
>[alex at lancelot exi]$ python -O timeit.py -s 'import pexa'
>'pexa.extract3("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
>100000 loops, best of 3: 9.59 usec per loop
>[alex at lancelot exi]$ python -O timeit.py -s 'import pexa'
>'pexa.extract_pyrex("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
>10000 loops, best of 3: 21.8 usec per loop
>[alex at lancelot exi]$ python -O timeit.py -s 'import pexa'
>'pexa.extract_c("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
>100000 loops, best of 3: 1.88 usec per loop
>[alex at lancelot exi]$
>
Interesting, but I think I will have to write a filter so I can
see a little more easily what your timeit.py outputs say ;-)

Regards,
Bengt Richter