Fredrik Lundh wrote:
mal wrote:
Just for compares: would you mind running the search routines in mxTextTools on the same machine ?
searching for "spam" in a string padded with "spaz" (1000 bytes on each side of the target):
string.find 0.112 ms
texttools.find 0.080 ms
sre8.search 0.059 pre.search 0.122
unicode.find 0.130 sre16.search 0.065
same test, without any false matches (padded with "-"):
string.find 0.035 ms
texttools.find 0.083 ms
sre8.search 0.050 pre.search 0.116
unicode.find 0.031 sre16.search 0.055
Those results are probably due to the fact that string.find does a brute force search. If it would do a last match char first search or even Boyer-Moore (this only pays off for long search targets) then it should be a lot faster than [s|p]re.
does the TextTools algorithm work with arbitrary character set sizes, btw?
The find function creates a Boyer-Moore search object for the search string (on every call). It compares 1-1 or using a translation table which is applied to the searched text prior to comparing it to the search string (this enables things like case insensitive search and character sets, but is about 45% slower). Real-life usage would be to create the search objects once per process and then reuse them. The Boyer-Moore table calcuation takes some time... But to answer your question: mxTextTools is 8-bit throughout. A Unicode aware version will follow by the end of this year. Thanks for checking, -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/