Reliable Ordering Of Sets

Hello all, I have a suggestion/idea for the Python standard (and/or a CPython implementation detail, considering its memory impact): having sets behave similar to dictionaries in that they preserve first-appearance order (i.e. dictionaries are ordered by key which was first inserted). As is said in the Python standard, a 'set' is by definition unordered. Yet, internally they must be stored in some order, and when iterated upon, converted to string, etc, some ordering must be used. Right now, it is unspecified, which can lead to interesting results I noticed:
While, obviously, things like `A == B` are not affected by this, many algorithms may affect their output if sets are treated as iterables (again, the argument could be made, as it was for 'dict' as well, that they should not treat the order as relevant at all). Further, I believe that having a specification regarding the order and some guarantees could be useful and important to cohesiveness of Python as a whole. Despite the obvious non-ordered nature of sets, I still think it is worth making them behave like `dict` objects (which are also, in some sense, 'unordered' by nature, and yet Python still specifies the order for 3.7+). I would like to suggests that `set` objects are ordered by insertion, so that: * Sets have a defined, repeatable order for maximum reproducibility (assuming the code generates the set in a stable way) * Tests which are outside of Python can do string comparisons and expect the same output each time (as with `dict` objects generated in a predictable way). * On a high-level note, Python specifies more exact behavior. It is my belief (and I'm sure many others share this as well) that unspecified/implementation-dependent/otherwise-undependable features should be eliminated and replaced with exact semantics that do not suprise users. Thoughts? Thanks, ~ Cade Brown Working on MAGMA <https://icl.cs.utk.edu/magma/> http://cade.site http://chemicaldevelopment.us

[Cade Brown <brown.cade@gmail.com>, suggests ordered sets] FYI, last time this went around was December, with over 100 msgs here: https://mail.python.org/archives/list/python-dev@python.org/thread/AEKCBGCKX...

Tim, thank you for the reference. (if anyone has any tips for quoting emails which were sent before I joined the mailing list, that would be appreciated. I will try to add relevant details in mine) I will try and address some of the points from that discussion and why they aren't satisfying to me, and why Python should (still) strive for reliably ordered sets. [Tim Peters <tim.peters@gmail.com>] maintaining insertion order. that's what the original "compact dict" _did_ do. When a large operation which is destructive (i.e. computing some difference of sets), there could be a `set_fill_in_holes()`, called when `num_entries_including_holes > real_num_entries * FILL_IN_FAC`, that would rebuild the holes. Naively, it may take O(N) complexity, but if done alongside an operation that is already O(N) (i.e. set difference of two sets of similar cardinality), it may quickly disappear (especially since no 'expensive' operations like calculating hashes are required) and not affect performance. Specifically, I bet it would have a small constant factor, since a simple O(N) for loop and perhaps O(N) scratch space is required, it would just be shuffling bytes and replacing indices. The problem may come when user code with repeated additions and subtractions of individual elements are presented (i.e. the user writes `for elem in B: A.remove(elem)` instead of `A -= set(B)`). This would cause a couple of single deletions to be O(N), which may be unacceptable; however, if this was done with specific ratios (like once the number of entries including holes reaches twice the amount of real entries), then it would only run every ~N deletions, which would amortize the operation to O(1+N/N)=O(1)
I must agree that the 'obvious' result (given that we are speaking of ordered-by-insertion sets) is `{1, 2, 3, 4, 5, 6, 7}`, since they appear in left-right order on the screen (and Python is specified as left-right-centric (i.e. lines start on the left, etc). However, you make good points about performance concerns; big-oh notation is not what users (and indeed, benchmarks, applications, etc) care about, it is rather `end_time - start_time`. As long as they are both O(N) union and O(1) testing, insertion, removal, Python implementations will work fine and not cause standard algorithms to take preposterous amounts of time (whereas if testing magically became O(N), it would take many algorithms to O(N^2), or worse, leaving large problems impossible). Aside from those travesties that should obviously be avoided, Python must weigh constant factor performance (i.e. being able to use optimizations as described in your post) with reliable semantics (with the well-defined result and iteration order of ordered-by-insertion sets). This, however, ultimately is an implementation detail, and if other semantics for 'set' make more sense, and avoid ambiguity (such as order), I would have to side with removing ambiguity at the loss of, say, 10% of speed. I know I'm not in the position to make that choice, however, so I think some tests could be useful to show just how different the performance is. I just thought of another pro of using ordered-by-insertion set. Consider the common idiom to order-by-value, `sorted(myset)`; if no order is specified, then it very may well be random (in the general case; I know integers for example will sort the same as their hashes, and will likely be in some order). If `myset` was constructed with Python enforcing ordered-by-insertion sets, then special attention could be made to generate elements of `myset` in sorted or near-sorted order, which may lead to asymptotic or at least better constant-factor performance under the `sorted()` function (especially with `timsort`, which will perform well if I generate `myset` from runs of sorted values. But I'm sure you already know that ;)). [Serhiy Storchaka]
A 'compact dictionary' implementation requires `N * 3 * sizeof(intptr_t)` (assuming hashes take up the same size as a pointer) for a (hash, key, val) tuple per entry, and `N_buckets * sizeof(idx_t)`, where `idx_t` is selected as the smallest (signed) integral size which can represent `N` (and -1, -2, for some special values). `N_buckets = N / load()` This costs (on most 64 bit systems): `load() * N_buckets * 3 * 8 + 1 * N_buckets` bytes for dicts of length <= 127, `load() * N_buckets * 3 * 8 + 2 * N_buckets` for dicts of length <= SHRT_MAX, and so on. A more naive dictionary would take simply `N_buckets * 3 * 8` bytes. A similar strategy could be used for sets (which would, as with dicts, preserve order of the items tuple array), and instead have `load() * N_buckets * 2 * 8 + 1 * N_buckets` bytes required (and so on, for other thresholds), versus a naive `N_buckets * 2 * 8` byte requirement. Therefore, I push back on the idea that ordered-by-insertion sets are not more memory efficient -- they can use the same optimizations that ordered-by-insertion dictionaries are currently using (albeit with tighter margins, since only 2 words are saved per empty bucket instead of 3, as with dictionaries). If someone can reference me why I'm either being completely ignorant or somehow wrong, I'd love to hear it ;). Thanks, ~ Cade Brown Working on MAGMA <https://icl.cs.utk.edu/magma/> http://cade.site http://chemicaldevelopment.us On Mon, Aug 24, 2020 at 5:17 PM Tim Peters <tim.peters@gmail.com> wrote:

The main purpose of sets is FAST membership testing. Everything I've seen in prior discussions convinced me that preserving insertion order would slow that down. If membership testing lost even 10% speed, I would be -1000 on the idea. If someone comes up with actual working code that is no more than 1% slower, is move to -0. If speed miraculously got faster in such a rewrite (analogous with dictionaries whose order was initially a happy side-effect, but I've seen explanations of why that wouldn't happen) I'd be +1. But in the last case, the entirely of my support would be for the speed up, and exactly none of it would be for the ordering.

When I quoted 10% as a rough measurement I was referring more to set operations like union, etc, which choosing the larger set may have a very real performance advantage. For membership tests which are false, it may be quicker, as less memory is accessed. The buckets will be smaller (1,2,4, or 8 bytes vs 8 or 16 (32bit vs 64bit). But, to compare the entry's hash, an indirection into the entries array is required... The hueristics become less clear when considering how many collisions there are. It may also change for different tuning parameters (i.e. a lower load factor would alleviate the pointer dereferences require to compare items, but at the cost of more memory). It may very well be the case that the target load factor only decreases a small amount (so that the ordered implementation still uses less memory), but the performance benefits of quick rejections make overall performance better. I don't know how 'objective' we can be here without discussing specific data sets people may be using. On Mon, Aug 24, 2020, 7:15 PM David Mertz <mertz@gnosis.cx> wrote:

Well, the other element for me is that I would never want an order GUARANTEE to preclude a later every faster version. I know there are different test scenarios, but I think the first step for it to be remotely reasonable is an ordered implementation that is faster in any reasonable test cases. For my own uses, I typically use sets with hundreds, maybe thousands of elements. Often strings. With tens of elements I care about speed less. I myself rarely work with sets of millions of elements. On Mon, Aug 24, 2020, 7:25 PM Cade Brown <brown.cade@gmail.com> wrote:

On 24.08.2020 22:43, Cade Brown wrote:
Why don't you simply use dicts instead of sets in case you want to benefit from the insert order feature ? Note that there were good reasons for having dict come closer to OrderedDict - dicts are used by the interpreter to manage namespaces and namespace definitions are ordered in Python code (they are read top to bottom, or left to right). It was desirable to have access to this order to reduce the number of hacks necessary to gain this information before Python 3.7. Sets aren't used in such a way. Their main purpose is to make membership tests fast, even when the set content changes frequently. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Aug 25 2020)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

[Cade Brown <brown.cade@gmail.com>, suggests ordered sets] FYI, last time this went around was December, with over 100 msgs here: https://mail.python.org/archives/list/python-dev@python.org/thread/AEKCBGCKX...

Tim, thank you for the reference. (if anyone has any tips for quoting emails which were sent before I joined the mailing list, that would be appreciated. I will try to add relevant details in mine) I will try and address some of the points from that discussion and why they aren't satisfying to me, and why Python should (still) strive for reliably ordered sets. [Tim Peters <tim.peters@gmail.com>] maintaining insertion order. that's what the original "compact dict" _did_ do. When a large operation which is destructive (i.e. computing some difference of sets), there could be a `set_fill_in_holes()`, called when `num_entries_including_holes > real_num_entries * FILL_IN_FAC`, that would rebuild the holes. Naively, it may take O(N) complexity, but if done alongside an operation that is already O(N) (i.e. set difference of two sets of similar cardinality), it may quickly disappear (especially since no 'expensive' operations like calculating hashes are required) and not affect performance. Specifically, I bet it would have a small constant factor, since a simple O(N) for loop and perhaps O(N) scratch space is required, it would just be shuffling bytes and replacing indices. The problem may come when user code with repeated additions and subtractions of individual elements are presented (i.e. the user writes `for elem in B: A.remove(elem)` instead of `A -= set(B)`). This would cause a couple of single deletions to be O(N), which may be unacceptable; however, if this was done with specific ratios (like once the number of entries including holes reaches twice the amount of real entries), then it would only run every ~N deletions, which would amortize the operation to O(1+N/N)=O(1)
I must agree that the 'obvious' result (given that we are speaking of ordered-by-insertion sets) is `{1, 2, 3, 4, 5, 6, 7}`, since they appear in left-right order on the screen (and Python is specified as left-right-centric (i.e. lines start on the left, etc). However, you make good points about performance concerns; big-oh notation is not what users (and indeed, benchmarks, applications, etc) care about, it is rather `end_time - start_time`. As long as they are both O(N) union and O(1) testing, insertion, removal, Python implementations will work fine and not cause standard algorithms to take preposterous amounts of time (whereas if testing magically became O(N), it would take many algorithms to O(N^2), or worse, leaving large problems impossible). Aside from those travesties that should obviously be avoided, Python must weigh constant factor performance (i.e. being able to use optimizations as described in your post) with reliable semantics (with the well-defined result and iteration order of ordered-by-insertion sets). This, however, ultimately is an implementation detail, and if other semantics for 'set' make more sense, and avoid ambiguity (such as order), I would have to side with removing ambiguity at the loss of, say, 10% of speed. I know I'm not in the position to make that choice, however, so I think some tests could be useful to show just how different the performance is. I just thought of another pro of using ordered-by-insertion set. Consider the common idiom to order-by-value, `sorted(myset)`; if no order is specified, then it very may well be random (in the general case; I know integers for example will sort the same as their hashes, and will likely be in some order). If `myset` was constructed with Python enforcing ordered-by-insertion sets, then special attention could be made to generate elements of `myset` in sorted or near-sorted order, which may lead to asymptotic or at least better constant-factor performance under the `sorted()` function (especially with `timsort`, which will perform well if I generate `myset` from runs of sorted values. But I'm sure you already know that ;)). [Serhiy Storchaka]
A 'compact dictionary' implementation requires `N * 3 * sizeof(intptr_t)` (assuming hashes take up the same size as a pointer) for a (hash, key, val) tuple per entry, and `N_buckets * sizeof(idx_t)`, where `idx_t` is selected as the smallest (signed) integral size which can represent `N` (and -1, -2, for some special values). `N_buckets = N / load()` This costs (on most 64 bit systems): `load() * N_buckets * 3 * 8 + 1 * N_buckets` bytes for dicts of length <= 127, `load() * N_buckets * 3 * 8 + 2 * N_buckets` for dicts of length <= SHRT_MAX, and so on. A more naive dictionary would take simply `N_buckets * 3 * 8` bytes. A similar strategy could be used for sets (which would, as with dicts, preserve order of the items tuple array), and instead have `load() * N_buckets * 2 * 8 + 1 * N_buckets` bytes required (and so on, for other thresholds), versus a naive `N_buckets * 2 * 8` byte requirement. Therefore, I push back on the idea that ordered-by-insertion sets are not more memory efficient -- they can use the same optimizations that ordered-by-insertion dictionaries are currently using (albeit with tighter margins, since only 2 words are saved per empty bucket instead of 3, as with dictionaries). If someone can reference me why I'm either being completely ignorant or somehow wrong, I'd love to hear it ;). Thanks, ~ Cade Brown Working on MAGMA <https://icl.cs.utk.edu/magma/> http://cade.site http://chemicaldevelopment.us On Mon, Aug 24, 2020 at 5:17 PM Tim Peters <tim.peters@gmail.com> wrote:

The main purpose of sets is FAST membership testing. Everything I've seen in prior discussions convinced me that preserving insertion order would slow that down. If membership testing lost even 10% speed, I would be -1000 on the idea. If someone comes up with actual working code that is no more than 1% slower, is move to -0. If speed miraculously got faster in such a rewrite (analogous with dictionaries whose order was initially a happy side-effect, but I've seen explanations of why that wouldn't happen) I'd be +1. But in the last case, the entirely of my support would be for the speed up, and exactly none of it would be for the ordering.

When I quoted 10% as a rough measurement I was referring more to set operations like union, etc, which choosing the larger set may have a very real performance advantage. For membership tests which are false, it may be quicker, as less memory is accessed. The buckets will be smaller (1,2,4, or 8 bytes vs 8 or 16 (32bit vs 64bit). But, to compare the entry's hash, an indirection into the entries array is required... The hueristics become less clear when considering how many collisions there are. It may also change for different tuning parameters (i.e. a lower load factor would alleviate the pointer dereferences require to compare items, but at the cost of more memory). It may very well be the case that the target load factor only decreases a small amount (so that the ordered implementation still uses less memory), but the performance benefits of quick rejections make overall performance better. I don't know how 'objective' we can be here without discussing specific data sets people may be using. On Mon, Aug 24, 2020, 7:15 PM David Mertz <mertz@gnosis.cx> wrote:

Well, the other element for me is that I would never want an order GUARANTEE to preclude a later every faster version. I know there are different test scenarios, but I think the first step for it to be remotely reasonable is an ordered implementation that is faster in any reasonable test cases. For my own uses, I typically use sets with hundreds, maybe thousands of elements. Often strings. With tens of elements I care about speed less. I myself rarely work with sets of millions of elements. On Mon, Aug 24, 2020, 7:25 PM Cade Brown <brown.cade@gmail.com> wrote:

On 24.08.2020 22:43, Cade Brown wrote:
Why don't you simply use dicts instead of sets in case you want to benefit from the insert order feature ? Note that there were good reasons for having dict come closer to OrderedDict - dicts are used by the interpreter to manage namespaces and namespace definitions are ordered in Python code (they are read top to bottom, or left to right). It was desirable to have access to this order to reduce the number of hacks necessary to gain this information before Python 3.7. Sets aren't used in such a way. Their main purpose is to make membership tests fast, even when the set content changes frequently. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Aug 25 2020)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/
participants (5)
-
Cade Brown
-
David Mertz
-
Jim Fasarakis-Hilliard
-
M.-A. Lemburg
-
Tim Peters