optimization question
Tim Peters
tim.one at comcast.net
Sun Aug 11 14:09:30 EDT 2002
[Andrew Koenig]
> ...
> The application I'm thinking of will do lots of comparisons of the form
>
> s[i:j] == t
Oops! My apologies -- I read "==" as "=" in your original post.
> where s is very large (many megabytes), i and j are arbitrary (and
> might be very far apart), and len(t) will almost always be small. I do
> not want to create a lot of large substrings, only to find out that
> none of them can possibly be equal to t because they're too big. And
> if I don't do something about it in advance, it will be a real
> nuisance to find all of the places where the code needs to change.
Your fears were right the first time: s[i:j] creates a new string object
each time.
> So, for example, in this case, I believe it is worth writing a function
> that might look like this:
>
> def eqsub(s, i, j, t):
> return s[i:j] == t
>
> and call this function everywhere instead of doing direct comparisons.
> Then the optimization becomes trivial:
>
> def eqsub(s, i, j, t):
> return (len(t) == j-i) and s[i:j] == t
>
> which avoids building the substrings unless necessary.
A bad alternative is to use the under-documented and doomed-to-deprecation
buffer object, a la
buffer(s, i, j-i) == buffer(t)
Creating a buffer is constant-time, and then buffer comparison takes at
worst min(j-i, len(t)) byte comparisons. So if t is almost always small,
this will almost always take worst-case O(t) time. It might be worth
lobbying to add rich comparison to the proposed "bytes" object (see
Python-Dev over the last month; I don't have canned pointers), although
conflating text strings with bytes is something Python is struggling to get
away from.
More information about the Python-list
mailing list