[Python-Dev] SRE 0.9.8 benchmarks

M.-A. Lemburg mal@lemburg.com
Thu, 03 Aug 2000 15:31:34 +0200


Fredrik Lundh wrote:
> 
> mal wrote:
> 
> > Just for compares: would you mind running the search
> > routines in mxTextTools on the same machine ?
> 
> > > searching for "spam" in a string padded with "spaz" (1000 bytes on
> > > each side of the target):
> > >
> > > string.find     0.112 ms
> 
> texttools.find    0.080 ms
> 
> > > sre8.search     0.059
> > > pre.search      0.122
> > >
> > > unicode.find    0.130
> > > sre16.search    0.065
> > >
> > > same test, without any false matches (padded with "-"):
> > >
> > > string.find     0.035 ms
> 
> texttools.find    0.083 ms
> 
> > > sre8.search     0.050
> > > pre.search      0.116
> > >
> > > unicode.find    0.031
> > > sre16.search    0.055
> >
> > Those results are probably due to the fact that string.find
> > does a brute force search. If it would do a last match char
> > first search or even Boyer-Moore (this only pays off for long
> > search targets) then it should be a lot faster than [s|p]re.
> 
> does the TextTools algorithm work with arbitrary character
> set sizes, btw?

The find function creates a Boyer-Moore search object
for the search string (on every call). It compares 1-1
or using a translation table which is applied
to the searched text prior to comparing it to the search
string (this enables things like case insensitive
search and character sets, but is about 45% slower). Real-life
usage would be to create the search objects once per process
and then reuse them. The Boyer-Moore table calcuation takes
some time...

But to answer your question: mxTextTools is 8-bit throughout.
A Unicode aware version will follow by the end of this year.

Thanks for checking,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/