optimization question

Sun Aug 11 14:09:30 EDT 2002

[Andrew Koenig]
> ...
> The application I'm thinking of will do lots of comparisons of the form
>
>         s[i:j] == t

Oops!  My apologies -- I read "==" as "=" in your original post.

> where s is very large (many megabytes), i and j are arbitrary (and
> might be very far apart), and len(t) will almost always be small.  I do
> not want to create a lot of large substrings, only to find out that
> none of them can possibly be equal to t because they're too big.  And
> if I don't do something about it in advance, it will be a real
> nuisance to find all of the places where the code needs to change.

Your fears were right the first time:  s[i:j] creates a new string object
each time.

> So, for example, in this case, I believe it is worth writing a function
> that might look like this:
>
>         def eqsub(s, i, j, t):
>                 return s[i:j] == t
>
> and call this function everywhere instead of doing direct comparisons.
> Then the optimization becomes trivial:
>
>         def eqsub(s, i, j, t):
>                 return (len(t) == j-i) and s[i:j] == t
>
> which avoids building the substrings unless necessary.

A bad alternative is to use the under-documented and doomed-to-deprecation
buffer object, a la

    buffer(s, i, j-i) == buffer(t)

Creating a buffer is constant-time, and then buffer comparison takes at
worst min(j-i, len(t)) byte comparisons.  So if t is almost always small,
this will almost always take worst-case O(t) time.  It might be worth
lobbying to add rich comparison to the proposed "bytes" object (see
Python-Dev over the last month; I don't have canned pointers), although
conflating text strings with bytes is something Python is struggling to get
away from.