[Python-Dev] Re: string find(substring) vs. substring in string
Fredrik Lundh
fredrik at pythonware.com
Wed Feb 16 22:19:20 CET 2005
A.M. Kuchling wrote:
>> time. But I would also hope that it would be smart enough to know that it
>> doesn't need to look past the 2nd character in 'not the xyz' when it is
>> searching for 'not there' (due to the lengths of the sequences).
>
> Assuming stringobject.c:string_contains is the right function, the
> code looks like this:
>
> size = PyString_GET_SIZE(el);
> rhs = PyString_AS_STRING(el);
> lhs = PyString_AS_STRING(a);
>
> /* optimize for a single character */
> if (size == 1)
> return memchr(lhs, *rhs, PyString_GET_SIZE(a)) != NULL;
>
> end = lhs + (PyString_GET_SIZE(a) - size);
> while (lhs <= end) {
> if (memcmp(lhs++, rhs, size) == 0)
> return 1;
> }
>
> So it's doing a zillion memcmp()s. I don't think there's a more
> efficient way to do this with ANSI C; memmem() is a GNU extension that
> searches for blocks of memory.
oops. so whoever implemented contains didn't even bother to look at the
find implementation... (which uses the same brute-force algorithm, but a better
implementation...)
> Perhaps saving some memcmps by writing
>
> if ((*lhs == *rhs) && memcmp(lhs++, rhs, size) == 0)
>
> would help.
memcmp still compiles to REP CMPB on many x86 compilers, and the setup
overhead for memcmp sucks on modern x86 hardware; it's usually better to
write your own bytewise comparision...
(and the fact that we're still brute-force search algorithms in "find" is a bit
embarrassing -- note that RE outperforms "in" by a factor of five.... guess
it's time to finish the split/replace parts of stringlib and produce a patch... ;-)
</F>
More information about the Python-Dev
mailing list