[Python-Dev] string find(substring) vs. substring in string

Wed Feb 16 22:03:10 CET 2005

> Assuming stringobject.c:string_contains is the right function, the
> code looks like this:
> 
>         size = PyString_GET_SIZE(el);
>         rhs = PyString_AS_STRING(el);
>         lhs = PyString_AS_STRING(a);
> 
>         /* optimize for a single character */
>         if (size == 1)
>                 return memchr(lhs, *rhs, PyString_GET_SIZE(a)) != NULL;
> 
>         end = lhs + (PyString_GET_SIZE(a) - size);
>         while (lhs <= end) {
>                 if (memcmp(lhs++, rhs, size) == 0)
>                         return 1;
>         }
> 
> So it's doing a zillion memcmp()s.  I don't think there's a more
> efficient way to do this with ANSI C; memmem() is a GNU extension that
> searches for blocks of memory.  Perhaps saving some memcmps by writing
> 
>          if ((*lhs  == *rhs) && memcmp(lhs++, rhs, size) == 0)
> 
> would help.

Which is exactly how s.find() wins this race. (I guess it loses when
it's found by having to do the "find" lookup.) Maybe string_contains
should just call string_find_internal()?

And then there's the question of how the re module gets to be faster
still; I suppose it doesn't bother with memcmp() at all.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)