union in Python

Mon Aug 18 21:02:11 EDT 2003

On Mon, 18 Aug 2003 18:54:20 -0300, Gustavo Niemeyer <niemeyer at conectiva.com> wrote:

>> When looking for a particular string, rather than a pattern, you are
>> probably better off to use the find or index method of that string
>> rather than re.
>> 
>> >>> 'abcdefg'.find('d')
>> 3
>
>Just out of curiosity, the preference for the find() method rather than
>using SRE might not be so obvious. SRE has an internal optimization
>which handles literals in a different way, avoiding some of the overhead
>from the engine. Additionally, it includes an overlap algorithm which
>speeds up the linear searching in many cases, specially with large data.
>
I wonder why find doesn't take advantage of the same thing, whatever it is
(Boyer-Moore? or some running fifo window hash check?). Probably the usual ...
people who can do it are too busy to do it in free time, and no-one's throwing money ;-)

>Here is a cooked example which highlights that fact:
>
>-----------------------------
>import time
>import re
>
>s1 = "x"*30+"a"
>s2 = "x"*1000000+"a"
>p1 = re.compile(s1)
>
>start = time.time()
>for i in range(1000):
>    p1.search(s2)
>    print time.time()-start
>
>start = time.time()
>for i in range(1000):
>    s2.find(s1)
>    print time.time()-start
>-----------------------------
>
>And here is a sample output:
>
>% python test.py
>45.2645740509
>233.810258031
>
That looks worth improving on. Looking quickly at stringobject.c,
this looks like it might be the forward search guts:
...
	if (dir > 0) {
		if (n == 0 && i <= last)
			return (long)i;
		last -= n;
		for (; i <= last; ++i)
			if (s[i] == sub[0] && memcmp(&s[i], sub, n) == 0)
				return (long)i;

...
If so, you can see a reason for your results.
But I don't see any reason why the performances couldn't be made to match.

Regards,
Bengt Richter