hash() yields different results for different platforms
Grant Edwards
grante at visi.com
Tue Jul 11 20:07:52 EDT 2006
On 2006-07-11, Qiangning Hong <hongqn at gmail.com> wrote:
> I'm writing a spider. I have millions of urls in a table (mysql) to
> check if a url has already been fetched. To check fast, I am
> considering to add a "hash" column in the table, make it a unique key,
> and use the following sql statement:
> insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
> to add new url.
>
> I believe this will be faster than making the "url" column unique key
> and doing string comparation. Right?
I doubt it will be significantly faster. Comparing two strings
and hashing a string are both O(N).
> However, when I come to Python's builtin hash() function, I
> found it produces different values in my two computers! In a
> pentium4, hash('a') -> -468864544; in a amd64, hash('a') ->
> 12416037344. Does hash function depend on machine's word
> length?
Apparently. :)
The low 32 bits match, so perhaps you should just use that
portion of the returned hash?
>>> hex(12416037344)
'0x2E40DB1E0L'
>>> hex(-468864544 & 0xffffffffffffffff)
'0xFFFFFFFFE40DB1E0L'
>>> hex(12416037344 & 0xffffffff)
'0xE40DB1E0L'
>>> hex(-468864544 & 0xffffffff)
'0xE40DB1E0L'
--
Grant Edwards grante Yow! Uh-oh!! I forgot
at to submit to COMPULSORY
visi.com URINALYSIS!
More information about the Python-list
mailing list