Match beginning of two strings

Mon Aug 4 07:56:04 EDT 2003

Ravi wrote:

> Hi,
> 
> I have about 200GB of data that I need to go through and extract the
> common first part of a line. Something like this.
> 
>  >>>a = "abcdefghijklmnopqrstuvwxyz"
>  >>>b = "abcdefghijklmnopBHLHT"
>  >>>c = extract(a,b)
>  >>>print c
> "abcdefghijklmnop"
> 
> Here I want to extract the common string "abcdefghijklmnop". Basically I
> need a fast way to do that for any two given strings. For my situation,
> the common string will always be at the beginning of both strings. I can

Here's my latest study on this:

*** pexa.py:

import sys

import psyco
psyco.full()

import cexa
import exa

def extract(a, b):
    m = min(len(a), len(b))
    for i in range(m):
        if a[i] != b[i]:
            return a[:i]
    return a[:m]

def extract2(a, b):
    for i, ai, bi in zip(xrange(len(a)), a, b):
        if ai != bi: return a[:i]
    return a[:m]

def extract3(a, b):
    for i, ai in enumerate(a):
        if ai != b[i:i+1]:
            return a[:i]
    return a

extract_pyrex = exa.exa

extract_c = cexa.cexa

*** exa.pyx:

def exa(a, b):
    cdef int la
    cdef int lb
    la = len(a)
    lb = len(b)
    cdef int lmin
    lmin = min(la, lb)
    cdef int i
    i = 0
    while i < lmin:
        if a[i] != b[i]:
            return a[:i]
        i = i + 1
    if lmin == la:
        return a
    else:
        return b

*** cexa.c:

#include <Python.h>

static PyObject*
cexa(PyObject* self, PyObject* args)
{
    char *a, *b;
    int la, lb;
    int lmin, i;
    if(!PyArg_ParseTuple(args, "s#s#", &a, &la, &b, &lb))
        return 0;

    lmin = la;
    if(lmin<lb) lmin = lb;

    for(i=0; i<lmin; i++)
        if(a[i] != b[i])
            break;

    return Py_BuildValue("s#", a, i);
}

static PyMethodDef cexaMethods[] = {
    {"cexa", cexa, METH_VARARGS, "Extract common prefix"},
    {0}
};

void
initcexa(void)
{
    Py_InitModule("cexa", cexaMethods);
}

I've built the pyrex-coded extension with:

from distutils.core import setup
from distutils.extension import Extension
from Pyrex.Distutils import build_ext

setup(
  name = "exa",
  ext_modules=[ 
    Extension("exa", ["exa.pyx"])
    ],
  cmdclass = {'build_ext': build_ext}
)

and the C-coded one with:

from distutils.core import setup
from distutils.extension import Extension

setup(
  name = "cexa",
  ext_modules=[ 
    Extension("cexa", ["cexa.c"])
    ],
)

and my measurements give me:

[alex at lancelot exi]$ python -O timeit.py -s 'import pexa' \
> 'pexa.extract("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
100000 loops, best of 3: 2.39 usec per loop
[alex at lancelot exi]$ python -O timeit.py -s 'import pexa'
'pexa.extract("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
100000 loops, best of 3: 2.14 usec per loop
[alex at lancelot exi]$ python -O timeit.py -s 'import pexa'
'pexa.extract2("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
10000 loops, best of 3: 30.2 usec per loop
[alex at lancelot exi]$ python -O timeit.py -s 'import pexa'
'pexa.extract3("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
100000 loops, best of 3: 9.59 usec per loop
[alex at lancelot exi]$ python -O timeit.py -s 'import pexa'
'pexa.extract_pyrex("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
10000 loops, best of 3: 21.8 usec per loop
[alex at lancelot exi]$ python -O timeit.py -s 'import pexa'
'pexa.extract_c("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
100000 loops, best of 3: 1.88 usec per loop
[alex at lancelot exi]$

So, it seems you can still get a tiny drop of extra speed
with a C-coded extension, though it's doubtful whether it's
worth the bother wrt the pyro-optimized simple Python code
in function 'extract'.  I'm not sure where I went wrong in
the Pyrex coding (it doesn't seem to be performing anywhere
as well as I thought it might) and I'll be happy for real
Pyrex expert to show me the way.

Of course, as others have pointed out, it's unclear from
your problem description that doing such operations pairwise
on a lot of pairs of strings is actually what you need.  It
IS quite possible that what you're doing could often be
better modeled, e.g., by repeated "prefix extractions"
between ONE fixed string and several other candidate strings;
or "prefix extraction" between a set of more than two strings.
In each case, it's likely that you can get much better
performance by more sophisticated algorithms.  However,
which algorithms those might be is unclear unless you can
provide mode details on what you're doing.

Alex