python/dist/src/Objects stringobject.c, 2.227, 2.228
Update of /cvsroot/python/python/dist/src/Objects In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv21408/Objects Modified Files: stringobject.c Log Message: * Beef-up testing of str.__contains__() and str.find(). * Speed-up "x in y" where x has more than one character. The existing code made excessive calls to the expensive memcmp() function. The new code uses memchr() to rapidly find a start point for memcmp(). In addition to knowing that the first character is a match, the new code also checks that the last character is a match. This significantly reduces the incidence of false starts (saving memcmp() calls and making quadratic behavior less likely). Improves the timings on: python -m timeit -r7 -s"x='a'*1000" "'ab' in x" python -m timeit -r7 -s"x='a'*1000" "'bc' in x" Once this code has proven itself, then string_find_internal() should refer to it rather than running its own version. Also, something similar may apply to unicode objects. Index: stringobject.c =================================================================== RCS file: /cvsroot/python/python/dist/src/Objects/stringobject.c,v retrieving revision 2.227 retrieving revision 2.228 diff -u -d -r2.227 -r2.228 --- stringobject.c 31 Jan 2005 17:09:25 -0000 2.227 +++ stringobject.c 20 Feb 2005 04:07:04 -0000 2.228 @@ -1002,8 +1002,12 @@ static int string_contains(PyObject *a, PyObject *el) { - const char *lhs, *rhs, *end; - int size; + char *s = PyString_AS_STRING(a); + const char *sub = PyString_AS_STRING(el); + char *last; + int len_sub = PyString_GET_SIZE(el); + int shortsub; + char firstchar, lastchar; if (!PyString_CheckExact(el)) { #ifdef Py_USING_UNICODE @@ -1016,20 +1020,29 @@ return -1; } } - size = PyString_GET_SIZE(el); - rhs = PyString_AS_STRING(el); - lhs = PyString_AS_STRING(a); - - /* optimize for a single character */ - if (size == 1) - return memchr(lhs, *rhs, PyString_GET_SIZE(a)) != NULL; - end = lhs + (PyString_GET_SIZE(a) - size); - while (lhs <= end) { - if (memcmp(lhs++, rhs, size) == 0) + if (len_sub == 0) + return 1; + /* last points to one char beyond the start of the rightmost + substring. When s<last, there is still room for a possible match + and s[0] through s[len_sub-1] will be in bounds. + shortsub is len_sub minus the last character which is checked + separately just before the memcmp(). That check helps prevent + false starts and saves the setup time for memcmp(). + */ + firstchar = sub[0]; + shortsub = len_sub - 1; + lastchar = sub[shortsub]; + last = s + PyString_GET_SIZE(a) - len_sub + 1; + while (s < last) { + s = memchr(s, firstchar, last-s); + if (s == NULL) + return 0; + assert(s < last); + if (s[shortsub] == lastchar && memcmp(s, sub, shortsub) == 0) return 1; + s++; } - return 0; }
participants (1)
-
rhettingerīŧ users.sourceforge.net