[Python-Dev] Counting collisions for the win

Fri Jan 20 20:04:21 CET 2012

Even if a MemoryException is raised I believe that is still a fundamental change in the documented contract of dictionary API. I don't believe there is a way to fix this without breaking someones application. The major differences I see between the two solutions is that counting will break people's applications who are otherwise following the documented api contract of dictionaries, and randomization will break people's applications who are violating the documented api contract of dictionaries. 

Personally I feel that the lesser of two evils is to reward those who followed the documentation, and not reward those who didn't.

So +1 for Randomization as the only option in 3.3, and off by default with a flag or environment variable in bug fixes. I think it's the only way to proceed that won't hurt people who have followed the documented behavior. 

On Friday, January 20, 2012 at 1:49 PM, Brett Cannon wrote:

> 
> 
> On Fri, Jan 20, 2012 at 13:15, Guido van Rossum <guido at python.org (mailto:guido at python.org)> wrote:
> > On Fri, Jan 20, 2012 at 5:10 AM, Barry Warsaw <barry at python.org (mailto:barry at python.org)> wrote:
> > > On Jan 20, 2012, at 01:50 PM, Victor Stinner wrote:
> > > 
> > > >Counting collision doesn't solve this case, but it doesn't make the
> > > >situation worse than before. Raising quickly an exception is better
> > > >than stalling for minutes, even if I agree than it is not the best
> > > >behaviour.
> > > 
> > > ISTM that adding the possibility of raising a new exception on dictionary
> > > insertion is *more* backward incompatible than changing dictionary order,
> > > which for a very long time has been known to not be guaranteed.  You're
> > > running some application, you upgrade Python because you apply all security
> > > fixes, and suddenly you're starting to get exceptions in places you can't
> > > really do anything about.  Yet those exceptions are now part of the documented
> > > public API for dictionaries.  This is asking for trouble.  Bugs will suddenly
> > > start appearing in that application's tracker and they will seem to the
> > > application developer like Python just added a new public API in a security
> > > release.
> > 
> > Dict insertion can already raise an exception: MemoryError. I think we should be safe if the new exception also derives from BaseException. We should actually eriously consider just raising MemoryException, since introducing a new built-in exception in a bugfix release is also very questionable: code explicitly catching or raising it would not work on previous bugfix releases of the same feature release.
> > 
> > > OTOH, if you change dictionary order and *that* breaks the application, then
> > > the bugs submitted to the application's tracker will be legitimate bugs that
> > > have to be fixed even if nothing else changed.
> > 
> > There are lots of things that are undefined according to the language spec (and quite possibly known to vary between versions or platforms or implementations like PyPy or Jython) but which we would never change in a bugfix release.
> > 
> > > So I still think we should ditch the paranoia about dictionary order changing,
> > > and fix this without counting.  A little bit of paranoia could creep back in
> > > by disabling the hash fix by default in stable releases, but I think it would
> > > be fine to make that a compile-time option.
> > 
> > I'm sorry, but I don't want to break a user's app with a bugfix release and say "haha your code was already broken you just didn't know it".
> > 
> > Sure, the dict order already varies across Python implementations, possibly across 32/64 bits or operating systems. But many organizations (I know a few :-) have a very large installed software base, created over many years by many people with varying skills, that is kept working in part by very carefully keeping the environment as constant as possible. This means that the target environment is much more predictable than it is for the typical piece of open source software.
> > 
> > Sure, a good Python developer doesn't write apps or tests that depend on dict order. But time and again we see that not everybody writes perfect code every time. Especially users writing "in-house" apps (as opposed to frameworks shared as open source) are less likely to always use the most robust, portable algorithms in existence, because they may know with much more certainty that their code will never be used on certain combinations of platforms. For example, I rarely think  about whether code I write might not work on IronPython or Jython, or even CPython on Windows. And if something I wrote suddenly needs to be ported to one of those, well, that's considered a port and I'll just accept that it might mean changing a few things.
> > 
> > The time to break a dependency on dict order is not with a bugfix release but with a feature release: those are more likely to break other things as well anyway, and uses are well aware that they have to test everything and anticipate having to fix some fraction of their code for each feature release. OTOH we have established a long and successful track record of conservative bugfix releases that don't break anything. (I am aware of exactly one thing that was broken by a bugfix release in application code I am familiar with.) 
> 
> Why can't we have our cake and eat it too?
> 
> Can we do hash randomization in 3.3 and use the hash count solution for bugfix releases? That way we get a basic fix into the bugfix releases that won't break people's code (hopefully) but we go with a more thorough (and IMO correct) solution of hash randomization starting with 3.3 and moving forward. We aren't breaking compatibility in any way by doing this since it's a feature release anyway where we change tactics. And it can't be that much work since we seem to have patches for both solutions. At worst it will make merging commits for those files affected by the patches, but that will most likely be isolated and not a common collision (and less of any issue once 3.3 is released later this year). 
> 
> I understand the desire to keep backwards-compatibility, but collision counting could cause an error in some random input that someone didn't expect to cause issues whether they were under a DoS attack or just had some unfortunate input from private data. The hash randomization, though, is only weak if someone is attacked, not if they are just using Python with their own private data. 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org (mailto:Python-Dev at python.org)
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: http://mail.python.org/mailman/options/python-dev/donald.stufft%40gmail.com
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20120120/efebaa1b/attachment.html>