[Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

Thu Jul 8 00:32:19 CEST 2010

On 7/7/2010 4:11 PM, Tres Seaver wrote:

> Antoine Pitrou wrote:
>> On Wed, 7 Jul 2010 19:44:31 +0200
>> Eli Bendersky<eliben at gmail.com>  wrote:
>>> For what it's worth, my benchmarking showed that modifying the
>>> heuristic to only kick in when there are more than 100 kinds of
>>> elements (Terry's option A) didn't affect the runtime of matching
>>> whatsoever, even when the heuristic *does* kick in. All it adds,
>>> really, is the overhead of a single 'if' statement. So it wouldn't be
>>> right to assume that somehow modifying the heuristic or allowing to
>>> turn it off will negatively affect performance in the special case Tim
>>> originally optimized for.
>>
>> Just because it doesn't affect performance in your tests doesn't mean it
>> won't do so in the general case. Consider a case where Tim's junk
>> optimization kicked in and helped improve performance a lot, but where
>> there are still less than 100 alphabet symbols. The new heuristic will
>> ruin this use case.
>
> That would describe pretty much every C program ever written, for
> instance, and nearly as high a percentage of all Python modules /
> scripts ever written:  the 'string.printable' set, less formfeed and
> vertical tab, is 98 characters long.

In the primary use case, programs are compared by line, not characters, 
and there are more than 100 different lines in any sensible program of 
at least 200 lines.

-- 
Terry Jan Reedy