[Python-Dev] Counting collisions for the win

Fri Jan 20 19:49:55 CET 2012

On Fri, Jan 20, 2012 at 13:15, Guido van Rossum <guido at python.org> wrote:

> On Fri, Jan 20, 2012 at 5:10 AM, Barry Warsaw <barry at python.org> wrote:
>
>> On Jan 20, 2012, at 01:50 PM, Victor Stinner wrote:
>>
>> >Counting collision doesn't solve this case, but it doesn't make the
>> >situation worse than before. Raising quickly an exception is better
>> >than stalling for minutes, even if I agree than it is not the best
>> >behaviour.
>>
>> ISTM that adding the possibility of raising a new exception on dictionary
>> insertion is *more* backward incompatible than changing dictionary order,
>> which for a very long time has been known to not be guaranteed.  You're
>> running some application, you upgrade Python because you apply all
>> security
>> fixes, and suddenly you're starting to get exceptions in places you can't
>> really do anything about.  Yet those exceptions are now part of the
>> documented
>> public API for dictionaries.  This is asking for trouble.  Bugs will
>> suddenly
>> start appearing in that application's tracker and they will seem to the
>> application developer like Python just added a new public API in a
>> security
>> release.
>>
>
> Dict insertion can already raise an exception: MemoryError. I think we
> should be safe if the new exception also derives from BaseException. We
> should actually eriously consider just raising MemoryException, since
> introducing a new built-in exception in a bugfix release is also very
> questionable: code explicitly catching or raising it would not work on
> previous bugfix releases of the same feature release.
>
> OTOH, if you change dictionary order and *that* breaks the application,
>> then
>> the bugs submitted to the application's tracker will be legitimate bugs
>> that
>> have to be fixed even if nothing else changed.
>>
>
> There are lots of things that are undefined according to the language spec
> (and quite possibly known to vary between versions or platforms or
> implementations like PyPy or Jython) but which we would never change in a
> bugfix release.
>
> So I still think we should ditch the paranoia about dictionary order
>> changing,
>> and fix this without counting.  A little bit of paranoia could creep back
>> in
>> by disabling the hash fix by default in stable releases, but I think it
>> would
>> be fine to make that a compile-time option.
>
>
> I'm sorry, but I don't want to break a user's app with a bugfix release
> and say "haha your code was already broken you just didn't know it".
>
> Sure, the dict order already varies across Python implementations,
> possibly across 32/64 bits or operating systems. But many organizations (I
> know a few :-) have a very large installed software base, created over many
> years by many people with varying skills, that is kept working in part by
> very carefully keeping the environment as constant as possible. This means
> that the target environment is much more predictable than it is for the
> typical piece of open source software.
>
> Sure, a good Python developer doesn't write apps or tests that depend on
> dict order. But time and again we see that not everybody writes perfect
> code every time. Especially users writing "in-house" apps (as opposed to
> frameworks shared as open source) are less likely to always use the most
> robust, portable algorithms in existence, because they may know with much
> more certainty that their code will never be used on certain combinations
> of platforms. For example, I rarely think  about whether code I write might
> not work on IronPython or Jython, or even CPython on Windows. And if
> something I wrote suddenly needs to be ported to one of those, well, that's
> considered a port and I'll just accept that it might mean changing a few
> things.
>
> The time to break a dependency on dict order is not with a bugfix release
> but with a feature release: those are more likely to break other things as
> well anyway, and uses are well aware that they have to test everything and
> anticipate having to fix some fraction of their code for each feature
> release. OTOH we have established a long and successful track record of
> conservative bugfix releases that don't break anything. (I am aware of
> exactly one thing that was broken by a bugfix release in application code I
> am familiar with.)
>

Why can't we have our cake and eat it too?

Can we do hash randomization in 3.3 and use the hash count solution for
bugfix releases? That way we get a basic fix into the bugfix releases that
won't break people's code (hopefully) but we go with a more thorough (and
IMO correct) solution of hash randomization starting with 3.3 and moving
forward. We aren't breaking compatibility in any way by doing this since
it's a feature release anyway where we change tactics. And it can't be that
much work since we seem to have patches for both solutions. At worst it
will make merging commits for those files affected by the patches, but that
will most likely be isolated and not a common collision (and less of any
issue once 3.3 is released later this year).

I understand the desire to keep backwards-compatibility, but collision
counting could cause an error in some random input that someone didn't
expect to cause issues whether they were under a DoS attack or just had
some unfortunate input from private data. The hash randomization, though,
is only weak if someone is attacked, not if they are just using Python with
their own private data.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20120120/085e2157/attachment-0001.html>