[Python-porting] Control of hash randomization

Aaron Meurer asmeurer at gmail.com
Sun May 27 10:07:55 CEST 2012


On Sun, May 27, 2012 at 12:22 AM,  <martin at v.loewis.de> wrote:
>> I'm still couldn't find how to actually get that seed, though.
>
>
> In C, you can look at _Py_HashSecret. In Python, you need to write
> an extension module, or use ctypes on the Python interpreter itself.
>
> However, this is not the seed: when an RNG is used, there is no seed,
> instead, the OS directly provides the hash secret. So your extension
> would also have to support setting the secret, which is tricky because
> the secret is already used by the time the extension gets loaded.
>
> So you would have to change the interpreter to support such a feature.
>
> If the crashes/test failures are frequent enough, I rather recommend
> testing with PYTHONHASHSEED set to random integers.

I see.  This would require spawning a new Python process, so it's not
ideal, but I guess it's the only solution.  I'll give ctypes a try
too.  I don't particularly feel like dabbling in C extensions just to
make our tests a little more helpful (maybe someone else more
courageous will give it a go).

>
> Also: if a test fails due to hash randomization, it should normally
> be possible to find the root cause by just reviewing the code (long
> enough). It may not be possible to reproduce the failure, but it
> should be obvious if a certain piece of code would fail under hash
> randomization.
>
> Regards,
> Martin

Ha!  Well, that's easy enough to say, but if all you have to work with
is an assertion that failed, and a very large code base, it might not
be so straight forward.  Furthermore, such situations are very often
not obvious (or else the author probably would not have written them
in the first place).

I do grant that this is possible in principle, but pragmatically
speaking, if it's possible to consistently reproduce a bug, it's 100
times easier to fix it.

It doesn't help that quite a few Python programmers don't understand
just what is and is not guaranteed by hash dependent objects.  For
example, I've seen this mistake made several times:

a = set(whatever) # or dict
b = list(a)
c = list(a)
assert b == c

The assertion does NOT have to hold, and I've seen situations where it doesn't.

That issue is pretty subtle.  The more common case is iterating
through a set or dict (or a tuple that was sorted by hash, which is
the most common case for SymPy), and there is some subtle fact about
the loop that makes the result differ depending on the result of
iteration.  Quite often, the result is still "correct" (in SymPy, this
generally means the answer is still mathematically correct), just not
the same as what the test expected.

Aaron Meurer


More information about the Python-porting mailing list