[Python-Dev] Re: string find(substring) vs. substring in string

Wed Feb 16 22:19:20 CET 2005

A.M. Kuchling wrote:

>> time. But I would also hope that it would be smart enough to know that it
>> doesn't need to look past the 2nd character in 'not the xyz' when it is
>> searching for 'not there' (due to the lengths of the sequences).
>
> Assuming stringobject.c:string_contains is the right function, the
> code looks like this:
>
> size = PyString_GET_SIZE(el);
> rhs = PyString_AS_STRING(el);
> lhs = PyString_AS_STRING(a);
>
> /* optimize for a single character */
> if (size == 1)
> return memchr(lhs, *rhs, PyString_GET_SIZE(a)) != NULL;
>
> end = lhs + (PyString_GET_SIZE(a) - size);
> while (lhs <= end) {
> if (memcmp(lhs++, rhs, size) == 0)
> return 1;
> }
>
> So it's doing a zillion memcmp()s.  I don't think there's a more
> efficient way to do this with ANSI C; memmem() is a GNU extension that
> searches for blocks of memory.

oops.  so whoever implemented contains didn't even bother to look at the
find implementation... (which uses the same brute-force algorithm, but a better
implementation...)

> Perhaps saving some memcmps by writing
>
> if ((*lhs  == *rhs) && memcmp(lhs++, rhs, size) == 0)
>
> would help.

memcmp still compiles to REP CMPB on many x86 compilers, and the setup
overhead for memcmp sucks on modern x86 hardware; it's usually better to
write your own bytewise comparision...

(and the fact that we're still brute-force search algorithms in "find" is a bit
embarrassing -- note that RE outperforms "in" by a factor of five....  guess
it's time to finish the split/replace parts of stringlib and produce a patch... ;-)

</F>