When convert two sets with the same elements to lists, are the lists always going to be the same?
Terry Reedy
tjreedy at udel.edu
Sat May 5 00:28:40 EDT 2012
Peng, I actually am thinking about it.
Underlying problem: while unordered means conceptually unordered as far
as the collection goes, the items in the collection, if homogenous
enough, may have a natural order, which users find hard to ignore. Even
if not comparable, an implementation such as CPython that uses linear
sequential memory will impose some order. Even if the implementation
uses unordered (holographic?) memory, order will be imposed to iterate,
as when creating a serialized representation of the collection. Abstract
objects, concrete objects, and serialized representations are three
different things, but people tend to conflate them.
Order consistency issues: if the unordered collection is iterated, when
can one expect the order to be the same? Theoretically, essentially
never, except that iterating dicts by keys, values, or key-value pairs
is guaranteed to be consistent, which means that re-iterating has to be
consistent. I actually think the same might as well be true for sets,
although there is no doc that says so.
If two collections are equal, should the iteration order be the same? It
has always been true that if hash values collide, insertion order
matters. However, a good hash function avoids hash collisions as much as
possible in practical use cases. Without doing something artificial, as
I did with the example, collisions should be especially rare on 64-bit
builds. If one collection has a series of additions and deletions so
that the underlying hash table has a different size than an equal
collection build just from insertions, then order will also be different.
If the same collection is built by insertion in the same order, but in
different runs, bugfix versions, or language versions, will iteration
order by the same? Historically, it has been for CPython for about a
decade, and people has come to depend on that constancy, in spite of
warning not to. Even core developers have not been immune, as the
CPython test suite has a few set or dict iteration order dependencies
until it was recently randomized.
Late last year, it became obvious that this constancy is a practical
denial-of-service security hole. The option to randomize hashing for
each run was added to 2.6, 2.7, 3.1, and 3.2. Randomized hashing by run
is part of 3.3. So some of the discussion above is obsolete. The example
I gave only works for that one run, as hash('a') changes each run. So
iteration order now changes with each run in fact as well as in theory.
For the doc, the problem is what to say and where without being
repetitous (and to get multiple people to agree ;-).
--
Terry Jan Reedy
More information about the Python-list
mailing list