On 19 October 2012 11:02, Duncan Booth email@example.com wrote:
Hrvoje Niksic firstname.lastname@example.org wrote:
On 10/19/2012 03:22 AM, Benjamin Peterson wrote:
It would be interesting to see how common it is for strings which have their hash computed to be compared.
Since all identifier-like strings mentioned in Python are interned, and therefore have had their hash computed, I would imagine comparing them to be fairly common. After all, strings are often used as makeshift enums in Python.
On the flip side, those strings are typically small, so a measurable overall speed improvement brought by such a change seems unlikely.
I'm pretty sure it would result in a small slowdown.
Many (most?) of the comparisons against interned identifiers will be done by dictionary lookups and the dictionary lookup code only tries the string comparison after it has determined that the hashes match. The only time dictionary key strings contents are actually compared is when the hash matches but the pointers are different; it is already the case that if the hashes don't match the strings are never compared.
My understanding is that:
The first part of string comparison is an identity check (pointer comparison) which catches the case of comparing two equal interned strings. What is not checked for is comparing two unequal interned strings (check both strings are interned and then conclude that unequal pointers means unequal strings). The hashes are also not checked during a standard string comparison. The lengths are also not checked since the unicode_compare function is a generic cmp() function for a rich comparison rather than a specific function for equal/unequal comparison: http://hg.python.org/cpython/file/321414874b26/Objects/unicodeobject.c#l6114
Most string comparisons occur during dict lookup in which case the hash is (implicitly?) checked before the strings are compared. A second hash comparison in this case would often be redundant.
My *opinion* is that:
On average character by character string comparison stops very quickly after comparing 1 or 2 characters. If this is assumed to be the case then it might be wasteful to check whether the strings have hashes and then compare the hashes (likewise for checking whether they are interned).
This was the subject of lengthy debate in the recent python-list thread "comparing strings from the back": http://mail.python.org/pipermail/python-list/2012-September/629866.html
Consider the case where the strings are not interned, non-empty, and differ in the first character:
def current(s1, s2): if s1 is s2: return True elif s1.len > 0 and s2.len > 0 and s1 != s2: return False # Most of the time we don't reach this point ...
def proposed_hashcheck(s1, s2): if s1 is s2: return True elif s1.hash is not None and s2.hash is not None and s1.hash != s2.hash: return False # I think we often reach this point when not comparing within a dict elif s1.len > 0 and s2.len > 0 and s1 != s2: return False ...
When comparing within a dict the hash check (modulo dict size) is already implicitly likely-affirmative and I assume (I'm not sure) that the full hashes are explicitly checked before comparing the strings. Either way I think that the additional hash check within the string comparison will normally be redundant.
When not comparing within a dict, it's hard to say how often the strings have hashes but if they usually don't then the extra hash-check will normally be redundant in that case as well. If the character by character comparison fails at the first character it may be quicker to optimise for that case than checking if both hashes are present and then comparing the hashes.
Other potential optimisations include comparing length before comparing characters:
def proposed_lencheck(s1, s2): if s1 is s2: return True elif s1.len != s2.len: return False elif s1.len > 0 and s2.len > 0 and s1 != s2: return False ...
In the above I'm trying to represent the fast path when comparing strings that differ in the first character. One case that isn't helped by either of length or hash checking is comparing equal strings: you always need to check all the characters (unless the identity check succeeds).
Depending on how often unequal interned strings are compared another method could be:
def proposed_interncheck(s1, s2): if s1 is s2: return True elif s1.interned and s2.interned: return False elif s1.len > 0 and s2.len > 0 and s1 != s2: return False ...
All of these solutions add complexity over the current method and require the string comparison functions to be refactored from the current generic rich compare method. None of them (I think) would help much when comparing strings whose first character differs (the common case). The one case where they may help significantly is that discussed in the other thread where you are comparing many strings that for whatever reason have a long common prefix:
'/usr/share/lib/prog/file1' == '/usr/share/lib/prog/file2'