[Python-Dev] Why not using the hash when comparing strings?

Fri Oct 19 15:09:05 CEST 2012

On 19 October 2012 11:02, Duncan Booth
<duncan.booth at suttoncourtenay.org.uk> wrote:
> Hrvoje Niksic <hrvoje.niksic at avl.com> wrote:
>
>> On 10/19/2012 03:22 AM, Benjamin Peterson wrote:
>>> It would be interesting to see how common it is for strings which have
>>> their hash computed to be compared.
>>
>> Since all identifier-like strings mentioned in Python are interned, and
>> therefore have had their hash computed, I would imagine comparing them
>> to be fairly common. After all, strings are often used as makeshift
>> enums in Python.
>>
>> On the flip side, those strings are typically small, so a measurable
>> overall speed improvement brought by such a change seems unlikely.
>
> I'm pretty sure it would result in a small slowdown.
>
> Many (most?) of the comparisons against interned identifiers will be done
> by dictionary lookups and the dictionary lookup code only tries the string
> comparison after it has determined that the hashes match. The only time
> dictionary key strings contents are actually compared is when the hash
> matches but the pointers are different; it is already the case that if the
> hashes don't match the strings are never compared.

My understanding is that:

The first part of string comparison is an identity check (pointer
comparison) which catches the case of comparing two equal interned
strings. What is not checked for is comparing two unequal interned
strings (check both strings are interned and then conclude that
unequal pointers means unequal strings). The hashes are also not
checked during a standard string comparison. The lengths are also not
checked since the unicode_compare function is a generic cmp() function
for a rich comparison rather than a specific function for
equal/unequal comparison:
http://hg.python.org/cpython/file/321414874b26/Objects/unicodeobject.c#l6114

Most string comparisons occur during dict lookup in which case the
hash is (implicitly?) checked before the strings are compared. A
second hash comparison in this case would often be redundant.

My *opinion* is that:

On average character by character string comparison stops very quickly
after comparing 1 or 2 characters. If this is assumed to be the case
then it might be wasteful to check whether the strings have hashes and
then compare the hashes (likewise for checking whether they are
interned).

This was the subject of lengthy debate in the recent python-list
thread "comparing strings from the back":
http://mail.python.org/pipermail/python-list/2012-September/629866.html

Consider the case where the strings are not interned, non-empty, and
differ in the first character:

    def current(s1, s2):
        if s1 is s2:
            return True
        elif s1.len > 0  and s2.len > 0 and s1[0] != s2[0]:
            return False
        # Most of the time we don't reach this point
        ...

    def proposed_hashcheck(s1, s2):
        if s1 is s2:
            return True
        elif s1.hash is not None and s2.hash is not None and s1.hash != s2.hash:
            return False
        # I think we often reach this point when not comparing within a dict
        elif s1.len > 0  and s2.len > 0 and s1[0] != s2[0]:
            return False
        ...

When comparing within a dict the hash check (modulo dict size) is
already implicitly likely-affirmative and I assume (I'm not sure) that
the full hashes are explicitly checked before comparing the strings.
Either way I think that the additional hash check within the string
comparison will normally be redundant.

When not comparing within a dict, it's hard to say how often the
strings have hashes but if they usually don't then the extra
hash-check will normally be redundant in that case as well. If the
character by character comparison fails at the first character it may
be quicker to optimise for that case than checking if both hashes are
present and then comparing the hashes.

Other potential optimisations include comparing length before
comparing characters:

    def proposed_lencheck(s1, s2):
        if s1 is s2:
            return True
        elif s1.len != s2.len:
            return False
        elif s1.len > 0  and s2.len > 0 and s1[0] != s2[0]:
            return False
        ...

In the above I'm trying to represent the fast path when comparing
strings that differ in the first character. One case that isn't helped
by either of length or hash checking is comparing equal strings: you
always need to check all the characters (unless the identity check
succeeds).

Depending on how often unequal interned strings are compared another
method could be:

    def proposed_interncheck(s1, s2):
        if s1 is s2:
            return True
        elif s1.interned and s2.interned:
            return False
        elif s1.len > 0  and s2.len > 0 and s1[0] != s2[0]:
            return False
        ...

All of these solutions add complexity over the current method and
require the string comparison functions to be refactored from the
current generic rich compare method. None of them (I think) would help
much when comparing strings whose first character differs (the common
case). The one case where they may help significantly is that
discussed in the other thread where you are comparing many strings
that for whatever reason have a long common prefix:

    '/usr/share/lib/prog/file1' == '/usr/share/lib/prog/file2'

Oscar