Tim, thank you for the reference.

(if anyone has any tips for quoting emails which were sent before I joined the mailing list, that would be appreciated. I will try to add relevant details in mine)


I will try and address some of the points from that discussion and why they aren't satisfying to me, and why Python should (still) strive for reliably ordered sets.

[Tim Peters <tim.peters@gmail.com>]
> "The problem" is that deletions leave _two_ kinds of holes: one in the hash table, and another in the contiguous vector.
> The latter holes cannot be filled with subsequent new hash+key+value records because that would break insertion order.
...
> Algorithms doing a lot of mixing of adds and deletes seem a lot more common for sets than for dicts, and so the ordered dict's basic implementation _approach_ is a lot less suitable for sets.
...
> Note: the holes left by deletions _wouldn't_ be "a problem" _except_ for maintaining insertion order.
> If we were only after the memory savings, then on deletion "the last" record in the contiguous array could be moved into the hole at once, leaving the array hole-free again. But that also changes the order. IIRC, that's what the original "compact dict" _did_ do.

When a large operation which is destructive (i.e. computing some difference of sets), there could be a `set_fill_in_holes()`, called when `num_entries_including_holes > real_num_entries * FILL_IN_FAC`, that would rebuild the holes.

Naively, it may take O(N) complexity, but if done alongside an operation that is already O(N) (i.e. set difference of two sets of similar cardinality), it may quickly disappear (especially since no 'expensive' operations like calculating hashes are required) and not affect performance. Specifically, I bet it would have a small constant factor, since a simple O(N) for loop and perhaps O(N) scratch space is required, it would just be shuffling bytes and replacing indices.

The problem may come when user code with repeated additions and subtractions of individual elements are presented (i.e. the user writes `for elem in B: A.remove(elem)` instead of `A -= set(B)`). This would cause a couple of single deletions to be O(N), which may be unacceptable; however, if this was done with specific ratios (like once the number of entries including holes reaches twice the amount of real entries), then it would only run every ~N deletions, which would amortize the operation to O(1+N/N)=O(1)

> {1, 2} | {3, 4, 5, 6, 7}
> [what should the above] return as ordered sets? Beats me.;
> The imbalance between the operands' cardinalities can be arbitrarily large, and "first make a copy of the larger one, then loop over the smaller one" is the obvious way to implement union to minimize the number of lookups needed.

I must agree that the 'obvious' result (given that we are speaking of ordered-by-insertion sets) is `{1, 2, 3, 4, 5, 6, 7}`, since they appear in left-right order on the screen (and Python is specified as left-right-centric (i.e. lines start on the left, etc).

However, you make good points about performance concerns; big-oh notation is not what users (and indeed, benchmarks, applications, etc) care about, it is rather `end_time - start_time`. As long as they are both O(N) union and O(1) testing, insertion, removal, Python implementations will work fine and not cause standard algorithms to take preposterous amounts of time (whereas if testing magically became O(N), it would take many algorithms to O(N^2), or worse, leaving large problems impossible). Aside from those travesties that should obviously be avoided, Python must weigh constant factor performance (i.e. being able to use optimizations as described in your post) with reliable semantics (with the well-defined result and iteration order of ordered-by-insertion sets).

This, however, ultimately is an implementation detail, and if other semantics for 'set' make more sense, and avoid ambiguity (such as order), I would have to side with removing ambiguity at the loss of, say, 10% of speed. I know I'm not in the position to make that choice, however, so I think some tests could be useful to show just how different the performance is.

I just thought of another pro of using ordered-by-insertion set. Consider the common idiom to order-by-value, `sorted(myset)`; if no order is specified, then it very may well be random (in the general case; I know integers for example will sort the same as their hashes, and will likely be in some order). If `myset` was constructed with Python enforcing ordered-by-insertion sets, then special attention could be made to generate elements of `myset` in sorted or near-sorted order, which may lead to asymptotic or at least better constant-factor performance under the `sorted()` function (especially with `timsort`, which will perform well if I generate `myset` from runs of sorted values. But I'm sure you already know that ;)).


[Serhiy Storchaka]
> But ordered set implementation is not more compact that the current set implementation (because for dicts the item is 3-word, but for sets it is 2-word).
> Also the current set implementation has some optimizations that the dict implementation never had, which will be lost in the ordered set implementation

A 'compact dictionary' implementation requires `N * 3 * sizeof(intptr_t)`  (assuming hashes take up the same size as a pointer) for a (hash, key, val) tuple per entry, and `N_buckets * sizeof(idx_t)`, where `idx_t` is selected as the smallest (signed) integral size which can represent `N` (and -1, -2, for some special values). `N_buckets = N / load()`

This costs (on most 64 bit systems): `load() * N_buckets * 3 * 8 + 1 * N_buckets` bytes for dicts of length <= 127, `load() * N_buckets * 3 * 8 + 2 * N_buckets` for dicts of length <= SHRT_MAX, and so on. A more naive dictionary would take simply `N_buckets * 3 * 8` bytes.

A similar strategy could be used for sets (which would, as with dicts, preserve order of the items tuple array), and instead have `load() * N_buckets * 2 * 8 + 1 * N_buckets` bytes required (and so on, for other thresholds), versus a naive `N_buckets * 2 * 8` byte requirement.

Therefore, I push back on the idea that ordered-by-insertion sets are not more memory efficient -- they can use the same optimizations that ordered-by-insertion dictionaries are currently using (albeit with tighter margins, since only 2 words are saved per empty bucket instead of 3, as with dictionaries). If someone can reference me why I'm either being completely ignorant or somehow wrong, I'd love to hear it ;).







Thanks,


On Mon, Aug 24, 2020 at 5:17 PM Tim Peters <tim.peters@gmail.com> wrote:
[Cade Brown <brown.cade@gmail.com>, suggests ordered sets]

FYI, last time this went around was December, with over 100 msgs here:

https://mail.python.org/archives/list/python-dev@python.org/thread/AEKCBGCKX23S4PMA5YNDHFJRHDL4JMKY/#AEKCBGCKX23S4PMA5YNDHFJRHDL4JMKY