Should set objects maintain insertion order too?
As of 3.7, dict objects are guaranteed to maintain insertion order. But set objects make no such guarantee, and AFAIK in practice they don't maintain insertion order either. Should they? I do have a use case for this. In one project I maintain a "ready" list of jobs; I need to iterate over it, but I also want fast lookup because I soemtimes remove jobs when they're subsequently marked "not ready". The existing set object would work fine here, except that it doesn't maintain insertion order. That means multiple runs of the program with the same inputs may result in running jobs in different orders, and this instability makes debugging more difficult. I've therefore switched from a set to a dict with all keys mapped to None, which provides all the set-like functionality I need. ISTM that all the reasons why dicts should maintain insertion order also apply to sets, and so I think we should probably do this. Your thoughts? //arry/ p.s. My dim recollection is that the original set object was a hacked-up copy of the dict object removing the ability to store values. So there's some precedent for dicts and sets behaving similarly. (Note: non-pejorative use of "hacked-up" here, that was a fine way to start.)
[Larry Hastings <larry@hastings.org>]
As of 3.7, dict objects are guaranteed to maintain insertion order. But set objects make no such guarantee, and AFAIK in practice they don't maintain insertion order either.
If they ever appear to, it's an accident you shouldn't rely on.
Should they?
From Raymond, 22 Dec 2017: https://twitter.com/raymondh/status/944454031870607360 """ Sets use a different algorithm that isn't as amendable to retaining insertion order. Set-to-set operations lose their flexibility and optimizations if order is required. Set mathematics are defined in terms of unordered sets. In short, set ordering isn't in the immediate future. """ Which is more an answer to "will they?" than "should they?" ;-)
That's not a reason why we shouldn't do it. Python is the language where speed, correctness, and readability trump performance every time. We should decide what semantics we want, then do that, period. And anyway, it seems like some genius always figures out how to make it fast sooner or later ;-) I can believe that, given the current implementation, sets might give up some of their optimizations if they maintained insertion order. But I don't understand what "flexibility" they would lose. Apart from being slower, what would set objects give up if they maintained insertion order? Are there operations on set objects that would no longer be possible? //arry/ On 12/15/19 6:58 PM, Tim Peters wrote:
[Larry Hastings <larry@hastings.org>]
As of 3.7, dict objects are guaranteed to maintain insertion order. But set objects make no such guarantee, and AFAIK in practice they don't maintain insertion order either. If they ever appear to, it's an accident you shouldn't rely on.
Should they? From Raymond, 22 Dec 2017:
https://twitter.com/raymondh/status/944454031870607360 """ Sets use a different algorithm that isn't as amendable to retaining insertion order. Set-to-set operations lose their flexibility and optimizations if order is required. Set mathematics are defined in terms of unordered sets. In short, set ordering isn't in the immediate future. """
Which is more an answer to "will they?" than "should they?" ;-)
On Dec 15, 2019, at 6:48 PM, Larry Hastings <larry@hastings.org> wrote:
As of 3.7, dict objects are guaranteed to maintain insertion order. But set objects make no such guarantee, and AFAIK in practice they don't maintain insertion order either. Should they?
I don't think they should. Several thoughts: * The corresponding mathematical concept is unordered and it would be weird to impose such as order. * You can already get membership testing while retaining insertion ordering by running dict.fromkeys(seq). * Set operations have optimizations that preclude giving a guaranteed order (for example, set intersection loops over the smaller of the two input sets). * To implement ordering, set objects would have to give-up their current membership testing optimization that exploits cache locality in lookups (it looks at several consecutive hashes at a time before jumping to the next random position in the table). * The ordering we have for dicts uses a hash table that indexes into a sequence. That works reasonably well for typical dict operations but is unsuitable for set operations where some common use cases make interspersed additions and deletions (that is why the LRU cache still uses a cheaply updated doubly-linked list rather that deleting and reinserting dict entries). * This idea has been discussed a couple times before and we've decided not to go down this path. I should document prominently because it is inevitable that it will be suggested periodically because it is such an obvious thing to consider. Raymond
Actually, for dicts the implementation came first. On Sun, Dec 15, 2019 at 19:50 Larry Hastings <larry@hastings.org> wrote:
That's not a reason why we shouldn't do it. Python is the language where speed, correctness, and readability trump performance every time. We should decide what semantics we want, then do that, period. And anyway, it seems like some genius always figures out how to make it fast sooner or later ;-)
I can believe that, given the current implementation, sets might give up some of their optimizations if they maintained insertion order. But I don't understand what "flexibility" they would lose. Apart from being slower, what would set objects give up if they maintained insertion order? Are there operations on set objects that would no longer be possible?
*/arry*
On 12/15/19 6:58 PM, Tim Peters wrote:
[Larry Hastings <larry@hastings.org> <larry@hastings.org>]
As of 3.7, dict objects are guaranteed to maintain insertion order. But set objects make no such guarantee, and AFAIK in practice they don't maintain insertion order either.
If they ever appear to, it's an accident you shouldn't rely on.
Should they?
From Raymond, 22 Dec 2017: https://twitter.com/raymondh/status/944454031870607360 """ Sets use a different algorithm that isn't as amendable to retaining insertion order. Set-to-set operations lose their flexibility and optimizations if order is required. Set mathematics are defined in terms of unordered sets. In short, set ordering isn't in the immediate future. """
Which is more an answer to "will they?" than "should they?" ;-)
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/U72GJCRH... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido (mobile)
On 12/15/19 8:25 PM, Raymond Hettinger wrote:
On Dec 15, 2019, at 6:48 PM, Larry Hastings <larry@hastings.org> wrote:
As of 3.7, dict objects are guaranteed to maintain insertion order. But set objects make no such guarantee, and AFAIK in practice they don't maintain insertion order either. Should they? I don't think they should.
Several thoughts:
* The corresponding mathematical concept is unordered and it would be weird to impose such as order.
I disagree--I assert it's no more or less weird than imposing order on dicts, which we eventually decided to do.
* You can already get membership testing while retaining insertion ordering by running dict.fromkeys(seq).
I'm not sure where you're going with this. The argument "you can do this with other objects" didn't preclude us from, for example, adding ordering to the dict object. We already had collections.OrderedDict, and we decided to add the functionality to the dict object.
* Set operations have optimizations that preclude giving a guaranteed order (for example, set intersection loops over the smaller of the two input sets).
* To implement ordering, set objects would have to give-up their current membership testing optimization that exploits cache locality in lookups (it looks at several consecutive hashes at a time before jumping to the next random position in the table).
* The ordering we have for dicts uses a hash table that indexes into a sequence. That works reasonably well for typical dict operations but is unsuitable for set operations where some common use cases make interspersed additions and deletions (that is why the LRU cache still uses a cheaply updated doubly-linked list rather that deleting and reinserting dict entries).
These are all performance concerns. As I mentioned previously in this thread, in my opinion we should figure out what semantics we want for the object, then implement those, and only after that should we worry about performance. I think we should decide the question "should set objects maintain insertion order?" literally without any consideration about performance implications.
* This idea has been discussed a couple times before and we've decided not to go down this path. I should document prominently because it is inevitable that it will be suggested periodically because it is such an obvious thing to consider.
Please do! In the email quoted by Tim Peters earlier in the thread, you stated that set objects "lose their flexibility" if they maintain order. Can you be more specific about what you meant by that? //arry/ p.s. To be clear, I'm not volunteering to add this feature to the set object implementation--that's above my pay grade. But I'm guessing that, if we decided we wanted these semantics for the dict object, someone would volunteer to implement it.
On Mon, Dec 16, 2019 at 1:33 PM Guido van Rossum <guido@python.org> wrote:
Actually, for dicts the implementation came first.
I had tried to implement the Ordered Set. Here is the implementation. https://github.com/methane/cpython/pull/23 Regards,
16.12.19 04:48, Larry Hastings пише:
As of 3.7, dict objects are guaranteed to maintain insertion order. But set objects make no such guarantee, and AFAIK in practice they don't maintain insertion order either. Should they?
I do have a use case for this. In one project I maintain a "ready" list of jobs; I need to iterate over it, but I also want fast lookup because I soemtimes remove jobs when they're subsequently marked "not ready". The existing set object would work fine here, except that it doesn't maintain insertion order. That means multiple runs of the program with the same inputs may result in running jobs in different orders, and this instability makes debugging more difficult. I've therefore switched from a set to a dict with all keys mapped to None, which provides all the set-like functionality I need.
ISTM that all the reasons why dicts should maintain insertion order also apply to sets, and so I think we should probably do this. Your thoughts?
The current dict implementation is called a "compact dict implementation", because it saves memory in common cases. It was the initial reason of writing it. At the same time there was a need in ordered dicts for kwarg and class namespace. We discovered that slightly modified compact dict implementation preserves order, and that its possible drawback (performance penalty) is too small if exists. But ordered set implementation is not more compact that the current set implementation (because for dicts the item is 3-word, but for sets it is 2-word). Also the current set implementation has some optimizations that the dict implementation never had, which will be lost in the ordered set implementation. Take to account that sets are way less used than dicts, and that you can use a dict if you need an ordered set-like object.
On 2019-12-16 06:00, Larry Hastings wrote: [...]
These are all performance concerns. As I mentioned previously in this thread, in my opinion we should figure out what semantics we want for the object, then implement those, and only after that should we worry about performance. I think we should decide the question "should set objects maintain insertion order?" literally without any consideration about performance implications.
Then this thread is missing arguments saying *why* ordered dicts are actually better, semantics-wise. Originally, making dicts ordered was all about performance (or rather memory efficiency, which falls in the same bucket.) It wasn't added because it's better semantics-wise. Here's one (very simplified and maybe biased) view of the history of dicts: * 2.x: Dicts are unordered, please don't rely on the order. * 3.3: Dict iteration order is now randomized. We told you not to rely on it! * 3.6: We now use an optimized implementation of dict that saves memory! As a side effect it makes dicts ordered, but that's an implementation detail, please don't rely on it. * 3.7: Oops, people are now relying on dicts being ordered. Insisting on people not relying on it is battling windmills. Also, it's actually useful sometimes, and alternate implementations can support it pretty easily. Let's make it a language feature! (Later it turns out MicroPython can't support it easily. Tough luck.) By itself, "we already made dicts do it" is not a great argument in the set *semantics* debate. Of course, it may turn out ordering was a great idea semantics-wise as well, but if I'm reading correctly, so far this thread has one piece of anectodal evidence for that.
I don't know if this works the same for sets, but for dicts, this is the semantics that everyone has wanted for a long time. It makes doctests (and similar things) easier to write, reduces a source of nondeterministic failure, and removes a wart that everyone had to learn:
{"foo": 1, "bar": 2, "baz": 3} {'baz': 3, 'foo': 1, 'bar': 2}
It's essentially the same reason as why Python specifies that expressions are (in most cases) evaluated from left to right. User don't realize that f()+g() might call g() before f(), and write code that assumes f() is called first -- the language should not disappoint them, optimization opportunities be damned. On Mon, Dec 16, 2019 at 4:23 AM Petr Viktorin <encukou@gmail.com> wrote:
On 2019-12-16 06:00, Larry Hastings wrote: [...]
These are all performance concerns. As I mentioned previously in this thread, in my opinion we should figure out what semantics we want for the object, then implement those, and only after that should we worry about performance. I think we should decide the question "should set objects maintain insertion order?" literally without any consideration about performance implications.
Then this thread is missing arguments saying *why* ordered dicts are actually better, semantics-wise.
Originally, making dicts ordered was all about performance (or rather memory efficiency, which falls in the same bucket.) It wasn't added because it's better semantics-wise. Here's one (very simplified and maybe biased) view of the history of dicts:
* 2.x: Dicts are unordered, please don't rely on the order. * 3.3: Dict iteration order is now randomized. We told you not to rely on it! * 3.6: We now use an optimized implementation of dict that saves memory! As a side effect it makes dicts ordered, but that's an implementation detail, please don't rely on it. * 3.7: Oops, people are now relying on dicts being ordered. Insisting on people not relying on it is battling windmills. Also, it's actually useful sometimes, and alternate implementations can support it pretty easily. Let's make it a language feature! (Later it turns out MicroPython can't support it easily. Tough luck.)
By itself, "we already made dicts do it" is not a great argument in the set *semantics* debate. Of course, it may turn out ordering was a great idea semantics-wise as well, but if I'm reading correctly, so far this thread has one piece of anectodal evidence for that. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/RFGF4IJ3... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
On 16Dec2019 0417, Petr Viktorin wrote:
Originally, making dicts ordered was all about performance (or rather memory efficiency, which falls in the same bucket.) It wasn't added because it's better semantics-wise. Here's one (very simplified and maybe biased) view of the history of dicts:
* 2.x: Dicts are unordered, please don't rely on the order. * 3.3: Dict iteration order is now randomized. We told you not to rely on it! * 3.6: We now use an optimized implementation of dict that saves memory! As a side effect it makes dicts ordered, but that's an implementation detail, please don't rely on it. * 3.7: Oops, people are now relying on dicts being ordered. Insisting on people not relying on it is battling windmills. Also, it's actually useful sometimes, and alternate implementations can support it pretty easily. Let's make it a language feature! (Later it turns out MicroPython can't support it easily. Tough luck.)
For the record, we missed out on a very memory efficient "frozendict" implementation because it can't maintain insertion order - Yury is currently proposing it as FrozenMap in PEP 603. https://discuss.python.org/t/pep-603-adding-a-frozenmap-type-to-collections/... Codifying semantics isn't always the kind of future-proof we necessarily want to have :) Cheers, Steve
On Sun, Dec 15, 2019 at 11:28 PM Raymond Hettinger < raymond.hettinger@gmail.com> wrote:
* The corresponding mathematical concept is unordered and it would be weird to impose such as order.
I'm with Raymond in not wanting sets to maintain insertion (or any) order. Even though I don't doubt that Larry--and no doubt other folks, from time to time--have a use for an "ordered set," I feel like it is bad practice to encourage that way of thinking about sets and using them. Admittedly, I was only lukewarm about making an insertion-order guarantee for dictionaries too. But for sets I feel more strongly opposed. Although it seems unlikely now, if some improved implementation of sets had the accidental side effects of making them ordered, I would still not want that to become a semantic guarantee. That said, having OrderedSet in collections module would be fine by me. It might have different performance characteristics, but so what? It would be a different class that folks could use or not, depending on how they felt about its behavior and performance profile. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
I can’t think of a time when I’ve really needed sets to maintain insertion order, but thinking about it from a user’s perspective, it’s a natural leap from ordered dicts to ordered sets. At least I don’t immediately think about sets as their mathematical equivalent, but as sets were originally proposed: efficient collections of value-less keys. Before we had sets, it was common to use dicts with values of None to represent the same concept. I’m fine with a decision not to change the ordering semantics of sets, but we should update the Language Reference to describe the language guarantees for both sets and dicts. The Library Reference does document the ordering semantics for both dicts and sets, but I wouldn’t say that information is easy to find. Maybe we can make the latter more clear too. Cheers, -Barry
On Dec 16, 2019, at 09:57, David Mertz <mertz@gnosis.cx> wrote:
On Sun, Dec 15, 2019 at 11:28 PM Raymond Hettinger <raymond.hettinger@gmail.com> wrote: * The corresponding mathematical concept is unordered and it would be weird to impose such as order.
I'm with Raymond in not wanting sets to maintain insertion (or any) order. Even though I don't doubt that Larry--and no doubt other folks, from time to time--have a use for an "ordered set," I feel like it is bad practice to encourage that way of thinking about sets and using them.
Admittedly, I was only lukewarm about making an insertion-order guarantee for dictionaries too. But for sets I feel more strongly opposed. Although it seems unlikely now, if some improved implementation of sets had the accidental side effects of making them ordered, I would still not want that to become a semantic guarantee.
That said, having OrderedSet in collections module would be fine by me. It might have different performance characteristics, but so what? It would be a different class that folks could use or not, depending on how they felt about its behavior and performance profile.
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/G5VFFODD... Code of Conduct: http://python.org/psf/codeofconduct/
[Guido]
... the language should not disappoint them, optimization opportunities be damned.
I would like to distinguish between two kinds of "optimization opportunities": theoretical ones that may or may not be exploited some day, and those that CPython has _already_ exploited. That is, we don't have a blank slate here. As Raymond said, the set implementation already diverged from the dict implementation in fundamental ways for "go fast" reasons. How much are people willing to see set operations slow down to get this new feature? For me, "none" ;-) Really. I have no particular use for "ordered" sets, but have set-slinging code that benefits _today_ from the "go fast" work Raymond did for them. Analogy: it was always obvious that list.sort() is "better" stable than not, but I ended up crafting a non-stable samplesort variant that ran faster than any stable sort anyone tried to write. For years. So we stuck with that, to avoid slowing down sorting across releases. The stable "timsort" finally managed to be _as_ fast as the older samplesort in almost all cases, and was very much faster in many real-world cases. "Goes faster" was the thing that really sold it, and so much so that its increased memory use didn't count for much against it in comparison. Kinda similarly, "ordered dicts" were (as has already been pointed out) originally introduced as a memory optimization ("compact" dicts), where the "ordered" part fell out more-or-less for free. The same approach wouldn't save memory for sets (or so I'm told), so the original motivation for compact dicts doesn't carry over. So I'm +1 on ordered sets if and only if there's an implementation that's at least as fast as what we already have. If not now, then, like ordered dicts evolved, offer a slower OrderedSet type in the `collections` module for those who really want it, and wait for magic ;-) BTW, what should {1, 2} | {3, 4, 5, 6, 7} return as ordered sets? Beats me.; The imbalance between the operands' cardinalities can be arbitrarily large, and "first make a copy of the larger one, then loop over the smaller one" is the obvious way to implement union to minimize the number of lookups needed. The speed penalty for, e.g., considering "the left" operand's elements' insertion orders to precede all "the right" operand's elements' insertion orders can be arbitrarily large. The concept of "insertion order" just doesn't make well-defined sense to me for any operation the produces a set from multiple input sets, unless it means no more than "whatever order happens to be used internally to add elements to the result set". Dicts don't have anything like that, although dict.update comes close, but in that case the result is defined by mutating the dict via a one-at-a-time loop over the argument dict.
That looks quite interesting. It looks like compact dict optimization applied to set. I had the same idea :-) If it reduces the memory footprint, keep insertion order and has low performance overhead, I would be an interesting idea! Victor Le lun. 16 déc. 2019 à 07:56, Inada Naoki <songofacandy@gmail.com> a écrit :
On Mon, Dec 16, 2019 at 1:33 PM Guido van Rossum <guido@python.org> wrote:
Actually, for dicts the implementation came first.
I had tried to implement the Ordered Set. Here is the implementation. https://github.com/methane/cpython/pull/23
Regards, _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/SGDD47GT... Code of Conduct: http://python.org/psf/codeofconduct/
-- Night gathers, and now my watch begins. It shall not end until my death.
On Mon 12/16/19, 9:59 AM, "David Mertz" <mertz@gnosis.cx<mailto:mertz@gnosis.cx>> wrote: Admittedly, I was only lukewarm about making an insertion-order guarantee for dictionaries too. But for sets I feel more strongly opposed. Although it seems unlikely now, if some improved implementation of sets had the accidental side effects of making them ordered, I would still not want that to become a semantic guarantee. Eek… No accidental side effects whenever possible, please. People come to rely upon them (like that chemistry paper example[*]), and changing the assumptions results in a lot of breakage down the line. Changing the length of AWS identifiers (e.g. instances from i-1234abcd to i-0123456789abcdef) was a huge deal; even though the identifier length was never guaranteed, plenty of folks had created database schemata with VARCHAR(10) for instance ids, for example. Break assumptions from day 1. If some platforms happen to return sorted results, sort the other platforms or toss in a sorted(key=lambda el: random.randomint()) call on the sorting platform. If you’re creating custom identifiers, allocate twice as many bits as you think you’ll need to store it. Yes, this is bad user code, but I’m all for breaking bad user code in obvious ways sooner rather than subtle ways later, especially in a language like Python. [*] https://arstechnica.com/information-technology/2019/10/chemists-discover-cro...
On Mon, Dec 16, 2019, 7:35 PM David Cuthbert <dacut@kanga.org> wrote:
On Mon 12/16/19, 9:59 AM, "David Mertz" <mertz@gnosis.cx> wrote:
If some improved implementation of sets had the accidental side effects of
making them ordered, I would still not want that to become a semantic guarantee.
Eek… No accidental side effects whenever possible, please. People come to rely upon them (like that chemistry paper example[*]), and changing the assumptions results in a lot of breakage down the line.
[*] https://arstechnica.com/information-technology/2019/10/chemists-discover-cro...
This is interesting. But it's a programming error that Python, or any
I'm not sure what point you are making really. Any particular implementation will have some behaviors that are not guaranteed by the language spec. I wouldn't want to PROHIBIT a set implementation that preserved insertion order. Nor, for example, would I want to prohibit one that stored string elements in alphabetical order, if that somehow had an independent performance advantage. But that's very different from wanting to REQUIRE sets to iterate in alphabetical order. If they did that, it wouldn't be *wrong*, but it also shouldn't be something we rely on. programming language, can not perfect against. glob.glib() does not promise to list matching files in any specific order. If the authors wrote code whose results vary based on the particular order files are processed in, that's a bug. It's their responsibility to order the files appropriately. Obviously, glob COULD alphabetize it's results. But that order is generally different from ordering by creation time. Which is, in turn, different from ordering by modification time or file size. I don't want to prohibit glob from doing any of these if the filesystem happens to make such easier (this is really a filesystem question more than an OS question). But I also don't want to make the order part of the semantics of the function... Nor do extra work to "randomize" the order to avoid some pattern that may happen to exist on some platform.
On 12/16/19 10:59 AM, Tim Peters wrote:
BTW, what should {1, 2} | {3, 4, 5, 6, 7}
return as ordered sets? Beats me.;
The obvious answer is {1, 2, 3, 4, 5, 6, 7}. Which is the result I got in Python 3.8 just now ;-) But that's just a side-effect of how hashing numbers works, the implementation, etc. It's rarely stable like this, and nearly any other input would have resulted in the scrambling we all (currently) expect to see. >>> {"apples", "peaches", "pumpkin pie"} | {"who's", "not", "ready", "holler", "I" } {'pumpkin pie', 'peaches', 'I', "who's", 'holler', 'ready', 'apples', 'not'} //arry/
[Tim]
BTW, what should
{1, 2} | {3, 4, 5, 6, 7}
return as ordered sets? Beats me.;
[Larry]
The obvious answer is {1, 2, 3, 4, 5, 6, 7}.
Why? An obvious implementation that doesn't ignore performance entirely is: def union(smaller, larger): if len(larger) < len(smaller): smaller, larger = larger, smaller result = larger.copy() for x in smaller: result.add(x) In the example, that would first copy {3, 4, 5, 6, 7}, and then add 1 and 2 (in that order) to it, giving {3, 4, 5, 6, 7, 1, 2} as the result. If it's desired that "insertion order" be consistent across runs, platforms, and releases, then what "insertion order" _means_ needs to be rigorously defined & specified for all set operations. This was comparatively trivial for dicts, because there are, e.g., no commutative binary operators defined on dicts. If you want to insist that `a | b` first list all the elements of a, and then all the elements of b that weren't already in a, regardless of cost, then you create another kind of unintuitive surprise: in general the result of "a | b" will display differently than the result of "b | a" (although the results will compare equal), and despite that the _user_ didn't "insert" anything.
[Raymond]
... * The ordering we have for dicts uses a hash table that indexes into a sequence. That works reasonably well for typical dict operations but is unsuitable for set operations where some common use cases make interspersed additions and deletions (that is why the LRU cache still uses a cheaply updated doubly-l linked list rather that deleting and reinserting dict entries).
I'm going to take a stab at fleshing that out a bit: ideally, an ordered dict stores hash+key+value records in a contiguous array with no "holes". That's ("no holes") where the memory savings comes from. The "holes" are in the hash table, which only hold indices _into_ the contiguous array (and the smallest width of C integer indices sufficient to get to every slot in the array). "The problem" is that deletions leave _two_ kinds of holes: one in the hash table, and another in the contiguous vector. The latter holes cannot be filled with subsequent new hash+key+value records because that would break insertion order. So in an app that mixes additions and deletions, the ordered dict needs to be massively rearranged at times to squash out the holes left behind by deletions, effectively moving all the holes to "the end", where they can again be used to reflect insertion order. Unordered dicts never had to be rearranged unless the total size changed "a lot", and that remains true of the set implementation. But in apps mixing adds and deletes, ordered dicts can need massive internal rearrangement at times even if the total size never changes by more than 1. Algorithms doing a lot of mixing of adds and deletes seem a lot more common for sets than for dicts, and so the ordered dict's basic implementation _approach_ is a lot less suitable for sets. Or, at least, that's my best attempt to flesh out Raymond's telegraphic thinking there. Note: the holes left by deletions _wouldn't_ be "a problem" _except_ for maintaining insertion order. If we were only after the memory savings, then on deletion "the last" record in the contiguous array could be moved into the hole at once, leaving the array hole-free again. But that also changes the order. IIRC, that's what the original "compact dict" _did_ do.
[Petr Viktorin <encukou@gmail.com>]
... Originally, making dicts ordered was all about performance (or rather memory efficiency, which falls in the same bucket.) It wasn't added because it's better semantics-wise.
As I tried to flesh out a bit in a recent message, the original "compact dict" idea got all the memory savings, but did _not_ maintain insertion order. Maintaining insertion order too complicated deletions (see recent message), but was deliberately done because people really did want "ordered dicts".
Here's one (very simplified and maybe biased) view of the history of dicts:
* 2.x: Dicts are unordered, please don't rely on the order. * 3.3: Dict iteration order is now randomized. We told you not to rely on it! * 3.6: We now use an optimized implementation of dict that saves memory! As a side effect it makes dicts ordered, but that's an implementation detail, please don't rely on it. * 3.7: Oops, people are now relying on dicts being ordered. Insisting on people not relying on it is battling windmills. Also, it's actually useful sometimes, and alternate implementations can support it pretty easily. Let's make it a language feature! (Later it turns out MicroPython can't support it easily. Tough luck.)
A very nice summary! My only quibble is as above: the "compact dict" implementation doesn't maintain insertion order naturally, _unless_ there are no deletions (which is, e.g., true of dicts constructed to pass keyword arguments). The code got hairier to maintain insertion order in the presence of mixing insertions and deletions.
On 12/16/19 6:30 PM, Tim Peters wrote:
If it's desired that "insertion order" be consistent across runs, platforms, and releases, then what "insertion order" _means_ needs to be rigorously defined & specified for all set operations. This was comparatively trivial for dicts, because there are, e.g., no commutative binary operators defined on dicts.
My intuition is that figuring out sensible semantics here is maybe not trivial, but hardly impossible. If this proposal goes anywhere I'd be willing to contribute to figuring it out.
If you want to insist that `a | b` first list all the elements of a, and then all the elements of b that weren't already in a, regardless of cost, then you create another kind of unintuitive surprise: in general the result of "a | b" will display differently than the result of "b | a" (although the results will compare equal), and despite that the _user_ didn't "insert" anything.
Call me weird--and I won't disagree--but I find nothing unintuitive about that. After all, that's already the world we live in: there are any number of sets that compare as equal but display differently. In current Python: >>> a = {'a', 'b', 'c'} >>> d = {'d', 'e', 'f'} >>> a | d {'f', 'e', 'd', 'a', 'b', 'c'} >>> d | a {'f', 'b', 'd', 'a', 'e', 'c'} >>> a | d == d | a True This is also true for dicts, in current Python, which of course do maintain insertion order. Dicts don't have the | operator, so I approximate the operation by duplicating the dict (which AFAIK preserves insertion order) and using update. >>> aa = {'a': 1, 'b': 1, 'c': 1} >>> dd = {'d': 1, 'e': 1, 'f': 1} >>> x = dict(aa) >>> x.update(dd) >>> y = dict(dd) >>> y.update(aa) >>> x {'a': 1, 'b': 1, 'c': 1, 'd': 1, 'e': 1, 'f': 1} >>> y {'d': 1, 'e': 1, 'f': 1, 'a': 1, 'b': 1, 'c': 1} >>> x == y True Since dicts already behave in exactly that way, I don't think it would be too surprising if sets behaved that way too. In fact, I think it's a little surprising that they behave differently, which I suppose was my whole thesis from the beginning. Cheers, //arry/
On 12/16/19 7:43 PM, Tim Peters wrote:
Here's one (very simplified and maybe biased) view of the history of dicts:
* 2.x: Dicts are unordered, please don't rely on the order. * 3.3: Dict iteration order is now randomized. We told you not to rely on it! * 3.6: We now use an optimized implementation of dict that saves memory! As a side effect it makes dicts ordered, but that's an implementation detail, please don't rely on it. * 3.7: Oops, people are now relying on dicts being ordered. Insisting on people not relying on it is battling windmills. Also, it's actually useful sometimes, and alternate implementations can support it pretty easily. Let's make it a language feature! (Later it turns out MicroPython can't support it easily. Tough luck.) A very nice summary! My only quibble is as above: the "compact dict" implementation doesn't maintain insertion order naturally, _unless_
[Petr Viktorin <encukou@gmail.com>] there are no deletions (which is, e.g., true of dicts constructed to pass keyword arguments). The code got hairier to maintain insertion order in the presence of mixing insertions and deletions.
Didn't some paths also get sliiiiightly slower as a result of maintaining insertion order when mixing insertions and deletions? My recollection is that that was part of the debate--not only "are we going to regret inflicting these semantics on posterity, and on other implementations?", but also "are these semantics worth the admittedly-small performance hit, in Python's most important and most-used data structure?". Also, I don't recall anything about us resigning ourselves to explicitly maintain ordering on dicts because people were relying on it, "battling windmills", etc. Dict ordering had never been guaranteed, a lack of guarantee Python had always taken particularly seriously. Rather, we decided it was a desirable feature, and one worth pursuing even at the cost of a small loss of performance. One prominent Python core developer** wanted this feature for years, and I recall them saying something like: Guido says, "When a programmer iterates over a dictionary and they see the keys shift around when the dictionary changes, they learn something!" To that I say--"Yes! They learn that Python is unreliable and random!" //arry/ ** I won't give their name here because I fear I'm misquoting everybody involved. Apologies in advance if that's the case!
[Tim]
If it's desired that "insertion order" be consistent across runs, platforms, and releases, then what "insertion order" _means_ needs to be rigorously defined & specified for all set operations. This was comparatively trivial for dicts, because there are, e.g., no commutative binary operators defined on dicts.
[Larry]
My intuition is that figuring out sensible semantics here is maybe not trivial, but hardly impossible. If this proposal goes anywhere I'd be willing to contribute to figuring it out.
No, it _is_ easy. It's just tedious, adding piles of words to the docs, and every constraint also constrains possible implementations. You snipped my example of implementing union, which you should really think about instead ;-)
If you want to insist that `a | b` first list all the elements of a, and then all the elements of b that weren't already in a, regardless of cost, then you create another kind of unintuitive surprise: in general the result of "a | b" will display differently than the result of "b | a" (although the results will compare equal), and despite that the _user_ didn't "insert" anything.
Call me weird--and I won't disagree--but I find nothing unintuitive about that. After all, that's already the world we live in: there are any number of sets that compare as equal but display differently. In current Python:
a = {'a', 'b', 'c'} d = {'d', 'e', 'f'} a | d {'f', 'e', 'd', 'a', 'b', 'c'} d | a {'f', 'b', 'd', 'a', 'e', 'c'}
Yup, it happens But under the sample union implementation I gave, it would never happen when the sets had different cardinalities (the different sizes are used to force a "standard" order then). For mathematical sets, | is commutative (it makes no difference to the result if the arguments are swapped - but can make a _big_ difference to implementation performance unless the implementation is free to pick the better order).
... This is also true for dicts, in current Python, which of course do maintain insertion order. Dicts don't have the | operator, so I approximate the operation by duplicating the dict (which AFAIK preserves insertion order)
Ya, it does, but I don't believe that's documented (it should be).
and using update.
Too different to be interesting here - update() isn't commutative. For sets, union, intersection, and symmetric difference are commutative.
... Since dicts already behave in exactly that way, I don't think it would be too surprising if sets behaved that way too. In fact, I think it's a little surprising that they behave differently, which I suppose was my whole thesis from the beginning.
I appreciate that dicts and sets behave differently in visible ways now. It just doesn't bother me enough that I'm cool with slowing set operations to "fix that".
[Larry]
Didn't some paths also get sliiiiightly slower as a result of maintaining insertion order when mixing insertions and deletions?
I paid no attention at the time. But in going from "compact dict" to "ordered dict", deletion all by itself got marginally cheaper. The downside was the need to rearrange the whole dict when too many "holes" built up. "Compact (but unordered) dict" doesn't need that.
My recollection is that that was part of the debate--not only "are we going to regret inflicting these semantics on posterity, and on other implementations?", but also "are these semantics worth the admittedly-small performance hit, in Python's most important and most-used data structure?".
There's a layer of indirection in compact dicts - lookups are distributed across two arrays. In non-compact unordered dicts, everything is in a single array. Cache effects may or may not make a measurable difference then, depending on all sorts of things.
Also, I don't recall anything about us resigning ourselves to explicitly maintain ordering on dicts because people were relying on it, "battling windmills", etc. Dict ordering had never been guaranteed, a lack of guarantee Python had always taken particularly seriously. Rather, we decided it was a desirable feature, and one worth pursuing even at the cost of a small loss of performance.
I'm not convinced there was a loss of performance. The delay between the implementation supplying ordered dicts, and the language guaranteeing it, was, I suspect, more a matter of getting extensive real-world experience about whether the possible new need to massively rearrange dict internals to remove "holes" would bite someone too savagely to live with. But, again, I wasn't paying attention at the time.
One prominent Python core developer** wanted this feature for years, and I recall them saying something like:
Guido says, "When a programmer iterates over a dictionary and they see the keys shift around when the dictionary changes, they learn something!" To that I say--"Yes! They learn that Python is unreliable and random!"
I never wanted ordered dicts, but never objected to them either. All in all, given that _I_ haven't seen a performance degradation, I like that they're more predictable, and love the memory savings. But as Raymond said (& I elaborated on), there's reason to believe that the implementation of ordered dicts is less suited to sets, where high rates of mixing adds and deletes is more common (thus triggering high rates of massive internal dict rearranging). Which is why I said much earlier that I'd be +1 on ordered sets only when there's an implementation that's as fast as what we have now. Backing up:
Python is the language where speed, correctness, and readability trump performance every time.
Speed trumping performance didn't make any sense ;-) So assuming you didn't intend to type "speed", I think you must have, say, Scheme or Haskell in mind there. "Practicality beats purity" is never seen on forums for those languages. Implementation speed & pain have played huge rules in many Python decisions. As someone who has wrestled with removing the GIL, you should know that better than anyone ;-)
On 12/16/19 10:32 PM, Tim Peters wrote:
[Larry]
Python is the language where speed, correctness, and readability trump performance every time. Speed trumping performance didn't make any sense ;-)
Sorry, that /was/ super unclear! I honestly meant speed of /development/. D'oh, //arry/
On Sun, Dec 15, 2019 at 09:00:31PM -0800, Larry Hastings wrote:
I think we should decide the question "should set objects maintain insertion order?" literally without any consideration about performance implications.
In your opening post:
I also want FAST lookup because I soemtimes remove jobs when they're subsequently marked "not ready". [emphasis added]
So do you care about performance or not? :-) If you do care (a bit) about performance, what slowdown would you be willing to take to get ordered sets? That would give us a good idea of the potential trade-offs that might be acceptable. Without being facetious[1] if you don't care about performance, you don't need a set, you could use a list. There's a serious point here: there's nothing sets can do that couldn't be implemented using lists. The reason we have sets, rather than a bunch of extra methods like intersection and symmetric_difference on lists, is because we considered performance. [1] Well... maybe a tiny bit... *wink* -- Steven
On 12/17/19 2:02 AM, Steven D'Aprano wrote:
Without being facetious[1] if you don't care about performance, you don't need a set, you could use a list.
Lists don't enforce uniqueness. Apart from that a list would probably work fine for my needs; in my admittedly-modest workloads I would probably never notice a performance difference. My anecdote was merely a jumping-off point for the discussion. "I don't care about performance" is not because I'm aching for Python to run my code slowly. It's because I'm 100% confident that the Python community will lovingly optimize the implementation. So when I have my language designer hat on, I really don't concern myself with performance. As I thought I said earlier in the thread, I think we should figure out the semantics we want /first,/ and /then/ we figure out how to make it fast. I'll also cop to "a foolish consistency is the hobgoblin of little minds". I lack this strongly mathematical view of sets others have espoused; instead I view them more like "dicts without values". I'm therefore disgruntled by this inconsistency between what are I see as closely related data structures, and it makes sense to me that they'd maintain their insertion order the same way that dictionaries now do. //arry/
On Tue, 17 Dec 2019 at 11:13, Larry Hastings <larry@hastings.org> wrote:
I lack this strongly mathematical view of sets others have espoused; instead I view them more like "dicts without values". I'm therefore disgruntled by this inconsistency between what are I see as closely related data structures, and it makes sense to me that they'd maintain their insertion order the same way that dictionaries now do.
I'll admit to a mathematical background, which probably influences my views. But I view sets as "collections" of values - values are "in" the set or not. I don't see them operationally, in the sense that you add and remove things from the set. Adding and removing are mechanisms for changing the set, but they aren't fundamentally what sets are about (which for me is membership). So insertion order on sets is largely meaningless for me (it doesn't matter *how* it got there, what matters is whether it's there now). Having said that, I also see dictionaries as mappings from key to value, so insertion order is mostly not something I care about for dictionaries either. I can see the use cases for ordered dictionaries, and now we have insertion order, it's useful occasionally, but it's not something that was immediately obvious to me based on my intuition. Similarly, I don't see sets as *needing* insertion order, although I concede that there probably are use cases, similar to dictionaries. The question for me, as with dictionaries, is whether the cost (in terms of clear semantics, constraints on future implementation choices, and actual performance now) is justified by the additional capabilities (remembering that personally, I'd consider insertion order preservation as very much a niche requirement). Paul
On Tue, 17 Dec 2019 at 11:48, Paul Moore <p.f.moore@gmail.com> wrote:
On Tue, 17 Dec 2019 at 11:13, Larry Hastings <larry@hastings.org> wrote:
I lack this strongly mathematical view of sets others have espoused; instead I view them more like "dicts without values". I'm therefore disgruntled by this inconsistency between what are I see as closely related data structures, and it makes sense to me that they'd maintain their insertion order the same way that dictionaries now do.
I'll admit to a mathematical background, which probably influences my views. But I view sets as "collections" of values - values are "in" the set or not. I don't see them operationally, in the sense that you add and remove things from the set. Adding and removing are mechanisms for changing the set, but they aren't fundamentally what sets are about (which for me is membership). So insertion order on sets is largely meaningless for me (it doesn't matter *how* it got there, what matters is whether it's there now).
Having said that, I also see dictionaries as mappings from key to value, so insertion order is mostly not something I care about for dictionaries either. I can see the use cases for ordered dictionaries, and now we have insertion order, it's useful occasionally, but it's not something that was immediately obvious to me based on my intuition. Similarly, I don't see sets as *needing* insertion order, although I concede that there probably are use cases, similar to dictionaries. The question for me, as with dictionaries, is whether the cost (in terms of clear semantics, constraints on future implementation choices, and actual performance now) is justified by the additional capabilities (remembering that personally, I'd consider insertion order preservation as very much a niche requirement).
As a random data point, I often see the pattern where one needs to remove duplicates from the list while preserving the order of first appearance. This is for example needed to get stability in various type-checking situations (like union items, type variables in base classes, type queries etc.) One can write a four line helper to achieve this, but I can see that having order preserving set could be useful. So again, it is "nice to have but not really critical". -- Ivan
17.12.19 14:05, Ivan Levkivskyi пише:
As a random data point, I often see the pattern where one needs to remove duplicates from the list while preserving the order of first appearance. This is for example needed to get stability in various type-checking situations (like union items, type variables in base classes, type queries etc.)
One can write a four line helper to achieve this, but I can see that having order preserving set could be useful. So again, it is "nice to have but not really critical".
This is a one-liner: list(dict.fromkeys(iterable))
... [Larry]
One prominent Python core developer** wanted this feature for years, and I recall them saying something like:
Guido says, "When a programmer iterates over a dictionary and they see the keys shift around when the dictionary changes, they learn something!" To that I say--"Yes! They learn that Python is unreliable and random!"
[Tim]
I never wanted ordered dicts, but never objected to them either. All in all, given that _I_ haven't seen a performance degradation, I like that they're more predictable, and love the memory savings.
That reads like I was correcting Larry for misquoting _me_. Sorry! He wasn't quoting me, and I didn't think he was. What he did quote was delightfully snarky enough that I didn't want to snip it, but not substantial enough to be worth the bother of addressing. So I just moved on to summarize what I said at the time (i.e., nothing).
[Larry]
"I don't care about performance" is not because I'm aching for Python to run my code slowly. It's because I'm 100% confident that the Python community will lovingly optimize the implementation.
I'm not ;-)
So when I have my language designer hat on, I really don't concern myself with performance. As I thought I said earlier in the thread, I think we should figure out the semantics we want first, and then we figure out how to make it fast.
Pretty much the opposite of how Python started. The original lists and dicts were deliberately designed to have dirt simple implementations, in reaction against ABC's "theoretically optimal" data structures that were a nightmare to maintain, and had so much overhead to support "optimality" in all cases that they were, in fact, much slower in common cases. There isn't magic waiting to be uncovered here: if you want O(1) deletion at arbitrary positions in an ordered sequence, a doubly linked list is _the_ way to do it. That's why, e.g., Raymond said he still uses a doubly linked list, instead of an ordered dict, in the LRU cache implementation. If that isn't clear, a cache can be maintained in least-to-most recently accessed order with an ordered dict like so: if key in cache: cached_value = cache.pop(key) # remove key else: compute cached_value assert key not in cache cache[key] = cached_value # most recently used at the end now return cached_value and deleting the least recently used is just "del cache[next(iter(cache))]" (off topic, just noting this is a fine use for the "first()" function that's the subject of a different thread). We _could_ structure an ordered dict's hash+key+value records as a doubly linked list instead (and avoid the need for O(N) reorganizations). But then we piss away much of the memory savings (to store the new links) that was the _point_ of compact dicts to begin with. So there was a compromise. No links, deletion on its own is O(1), but can periodically require O(N) under-the-covers table reorganizations to squash out the holes. "Suitable enough" for ordered dicts, but _thought_ to be unsuitable for ordered sets (which appear to have higher rates of mixing deletions with insertions - the LRU cache example being an exception). But there are also other optimizations in the current set implementation, so "fine, add the doubly linked list to sets but not to dicts" is only part of it. Which may or may not be possible to match, let alone beat, in an ordered set implementation. A practical barrier now is that Python is too mature to bank on loving optimizations _after_ a change to a core feature is released. It's going to need a highly polished implementation first. I certainly don't object to anyone trying, but it won't be me ;-)
On Wed., 18 Dec. 2019, 5:51 am Tim Peters, <tim.peters@gmail.com> wrote:
But there are also other optimizations in the current set implementation, so "fine, add the doubly linked list to sets but not to dicts" is only part of it.
Which may or may not be possible to match, let alone beat, in an ordered set implementation. A practical barrier now is that Python is too mature to bank on loving optimizations _after_ a change to a core feature is released. It's going to need a highly polished implementation first.
Starting with "collections.OrderedSet" seems like a reasonable idea, though - that way "like a built-in set, but insertion order preserving" will have an obvious and readily available answer, and it should also make performance comparisons easier. Cheers, Nick. ______________________________
[Nick Coghlan <ncoghlan@gmail.com>]
Starting with "collections.OrderedSet" seems like a reasonable idea, though - that way "like a built-in set, but insertion order preserving" will have an obvious and readily available answer, and it should also make performance comparisons easier.
Ya, I suggested starting with collections.OrderedSet earlier, but gave up on it. The problem is that the "use case" here isn't really a matter of functionality, but of consistency: "it would be nice if", like dicts enjoy now, the iteration order of sets was defined in an implementation-independent way. That itch isn't scratched unless the builtin set type defines it.
OrderedSet implementations: - https://github.com/methane/cpython/pull/23/files - https://pypi.org/search/?q=orderedset - https://pypi.org/project/orderedset/ - https://code.activestate.com/recipes/576694/ - https://pypi.org/project/ordered-set/ - https://github.com/pandas-dev/pandas/blob/master/pandas/core/indexes/base.py... (pandas' Index types) - https://pypi.org/project/sortedcollections/ [Ordered] Sets and some applications: - https://en.wikipedia.org/wiki/Set_(mathematics) - https://en.wikipedia.org/wiki/Set_notation - https://en.wikipedia.org/wiki/Set-builder_notation - https://en.wikipedia.org/wiki/Glossary_of_set_theory - https://en.wikipedia.org/wiki/Set_(abstract_data_type) - Comparators - https://en.wikipedia.org/wiki/Total_order - https://en.wikipedia.org/wiki/Partially_ordered_set - https://en.wikipedia.org/wiki/Causal_sets On Thu, Dec 19, 2019 at 12:05 PM Tim Peters <tim.peters@gmail.com> wrote:
[Nick Coghlan <ncoghlan@gmail.com>]
Starting with "collections.OrderedSet" seems like a reasonable idea, though - that way "like a built-in set, but insertion order preserving" will have an obvious and readily available answer, and it should also make performance comparisons easier.
Ya, I suggested starting with collections.OrderedSet earlier, but gave up on it.
The problem is that the "use case" here isn't really a matter of functionality, but of consistency: "it would be nice if", like dicts enjoy now, the iteration order of sets was defined in an implementation-independent way. That itch isn't scratched unless the builtin set type defines it. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/PM5ENMLR... Code of Conduct: http://python.org/psf/codeofconduct/
On Fri., 20 Dec. 2019, 2:58 am Tim Peters, <tim.peters@gmail.com> wrote:
[Nick Coghlan <ncoghlan@gmail.com>]
Starting with "collections.OrderedSet" seems like a reasonable idea, though - that way "like a built-in set, but insertion order preserving" will have an obvious and readily available answer, and it should also make performance comparisons easier.
Ya, I suggested starting with collections.OrderedSet earlier, but gave up on it.
The problem is that the "use case" here isn't really a matter of functionality, but of consistency: "it would be nice if", like dicts enjoy now, the iteration order of sets was defined in an implementation-independent way. That itch isn't scratched unless the builtin set type defines it.
I took Larry's request a slightly different way: he has a use case where he wants order preservation (so built in sets aren't good), but combined with low cost duplicate identification and elimination and removal of arbitrary elements (so lists and collections.deque aren't good). Organising a work queue that way seems common enough to me to be worthy of a stdlib solution that doesn't force you to either separate a "job id" from the "job" object in an ordered dict, or else use an ordered dict with "None" values. So while it means answering "No" to the "Should builtin sets preserve order?" question (at least for now), adding collections.OrderedSet *would* address that "duplicate free pending task queue" use case. Cheers, Nick.
On Thu, Dec 19, 2019 at 5:39 PM Nick Coghlan <ncoghlan@gmail.com> wrote:
I took Larry's request a slightly different way: he has a use case where he wants order preservation (so built in sets aren't good), but combined with low cost duplicate identification and elimination and removal of arbitrary elements (so lists and collections.deque aren't good). Organising a work queue that way seems common enough to me to be worthy of a stdlib solution that doesn't force you to either separate a "job id" from the "job" object in an ordered dict, or else use an ordered dict with "None" values.
It's not obvious to me that insertion order is even the most obvious or most commonly relevant sort order. I'm sure it is for Larry's program, but often a work queue might want some other order. Very often queues might instead, for example, have a priority number assigned to them. The widely used third-party `sortedcollections` module maintains either "natural" order or some key order. That could be engineered to be insertion order fairly easily. Admittedly, 'OrderedSet' sounds a bit different from 'SortedSet'. In [8]: sortedcontainers.SortedSet(['Wake Up', 'Drink Coffee', 'Make Breakfast']) Out[8]: SortedSet(['Drink Coffee', 'Make Breakfast', 'Wake Up']) In [9]: sortedcontainers.SortedSet(['Wake Up', 'Drink Coffee', 'Make Breakfast'], key=lambda s: s[-1]) Out[9]: SortedSet(['Drink Coffee', 'Wake Up', 'Make Breakfast'], key=<function <lambda> at 0x7f68f5985400>) In [10]: 'Make Breakfast' in tasks Out[10]: True -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
[Nick]
I took Larry's request a slightly different way:
Sorry, I was unclear: by "use case" I had in mind what appeared to me to be the overwhelming thrust of the _entirety_ of this thread so far, not Larry's original request.
he has a use case where he wants order preservation (so built in sets aren't good), but combined with low cost duplicate identification and elimination and removal of arbitrary elements (so lists and collections.deque aren't good).
Sure.
Organising a work queue that way seems common enough to me to be worthy of a stdlib solution that doesn't force you to either separate a "job id" from the "job" object in an ordered dict, or else use an ordered dict with "None" values.
So while it means answering "No" to the "Should builtin sets preserve order?" question (at least for now), adding collections.OrderedSet *would* address that "duplicate free pending task queue" use case.
Only Larry can answer whether that would meet his perceived need. My _guess_ is that he wouldn't know OrderedSet existed, and, even if he did, he'd use a dict with None values anyway (because it's less hassle and does everything he wanted). But I have to add that I don't know enough about his original use case to be sure it wasn't "an XY problem": https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem """ The XY problem is asking about your attempted solution rather than your actual problem. That is, you are trying to solve problem X, and you think solution Y would work, but instead of asking about X when you run into trouble, you ask about Y. """ Because I've never had a "job queue" problem where an OrderedSet would have helped. Can't guess what's different about Larry's problem.
[David Mertz <mertz@gnosis.cx>]
It's not obvious to me that insertion order is even the most obvious or most commonly relevant sort order. I'm sure it is for Larry's program, but often a work queue might want some other order. Very often queues might instead, for example, have a priority number assigned to them.
Indeed, and it makes me more cautious that claims for the _usefulness_ (rather than just consistency) of an ordered set are missing an XY problem. The only "natural" value of insertion-order ordering for a dynamic ordered set is that it supports FIFO queues. Well, that's one thing that a deque already excels at, but much more obviously, flexibly, and space- and time- efficiently. For dicts the motivating use cases were much more "static", like preserving textual order of keyword argument specifications, and avoiding gratuitous changing of order when round-tripping key-value encodings like much of JSON. So what problem(s) would a dynamic ordered set really be aiming at? I don't know. Consistency & predictability I understand and appreciate, but little beyond that. Even FIFO ordering is somewhat a PITA, since `next(iter(set))` is an overly elaborate way to need to spell "get the first" if a FIFO queue _is_ a prime dynamic use case. And there's no efficient way at all to access the other end (although I suppose set.pop() would - like dict.popitem() - change to give a _destructive_ way to get at "the other end").
On Fri., 20 Dec. 2019, 1:56 pm Tim Peters, <tim.peters@gmail.com> wrote:
So what problem(s) would a dynamic ordered set really be aiming at? I don't know. Consistency & predictability I understand and appreciate, but little beyond that. Even FIFO ordering is somewhat a PITA, since `next(iter(set))` is an overly elaborate way to need to spell "get the first" if a FIFO queue _is_ a prime dynamic use case. And there's no efficient way at all to access the other end (although I suppose set.pop() would - like dict.popitem() - change to give a _destructive_ way to get at "the other end").
I must admit that I was assuming without stating that a full OrderedSet implementation would support the MutableSequence interface. Dictionaries can't do that because they already support the MutableMapping interface. Cheers, Nick.
[Nick]
I must admit that I was assuming without stating that a full OrderedSet implementation would support the MutableSequence interface.
Efficient access via index position too would be an enormous new requirement, My bet: basic operations would need to change from O(1) to O(log(N)). BTW, in previous msgs there are links to various implementations calling themselves "ordered sets". One of them supplies O(1) indexing, but at the expense of making deletion O(N) (!): https://pypi.org/project/ordered-set/ If efficient indexing is really wanted, then the original "use case" Larry gave was definitely obscuring an XY problem ;-)
How slow and space-inefficient would it be to just implement the set methods on top of dict? Do dicts lose insertion order when a key is deleted? AFAIU, OrderedDict do not lose insertion order on delete. Would this limit the utility of an ordered set as a queue? What set methods does a queue need to have? On Thu, Dec 19, 2019, 11:41 PM Tim Peters <tim.peters@gmail.com> wrote:
[Nick]
I must admit that I was assuming without stating that a full OrderedSet implementation would support the MutableSequence interface.
Efficient access via index position too would be an enormous new requirement, My bet: basic operations would need to change from O(1) to O(log(N)).
BTW, in previous msgs there are links to various implementations calling themselves "ordered sets". One of them supplies O(1) indexing, but at the expense of making deletion O(N) (!):
https://pypi.org/project/ordered-set/
If efficient indexing is really wanted, then the original "use case" Larry gave was definitely obscuring an XY problem ;-) _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/IBRSGUTH... Code of Conduct: http://python.org/psf/codeofconduct/
On Fri, Dec 20, 2019 at 4:15 PM Wes Turner <wes.turner@gmail.com> wrote:
How slow and space-inefficient would it be to just implement the set methods on top of dict?
Speed: Dict doesn't cache the position of the first item. Calling next(iter(D)) repeatedly is O(N) in worst case. Space: It waste 8bytes per member.
Do dicts lose insertion order when a key is deleted? AFAIU, OrderedDict do not lose insertion order on delete.
Dict keeps insertion order after deletion too.
Would this limit the utility of an ordered set as a queue? What set methods does a queue need to have?
I want O(1) D.popleft(). (The name is borrowed from deque. popfirst() would be better maybe). -- Inada Naoki <songofacandy@gmail.com>
On Thu, Dec 19, 2019 at 9:57 PM Tim Peters <tim.peters@gmail.com> wrote:
It's not obvious to me that insertion order is even the most obvious or most commonly relevant sort order. I'm sure it is for Larry's program, but often a work queue might want some other order. Very often queues might instead, for example, have a priority number assigned to them.
Indeed, and it makes me more cautious that claims for the _usefulness_ (rather than just consistency) of an ordered set are missing an XY problem. The only "natural" value of insertion-order ordering for a dynamic ordered set is that it supports FIFO queues. Well, that's one thing that a deque already excels at, but much more obviously, flexibly, and space- and time- efficiently.
I agree with Tim and David. Furthermore, I'd like to point out that the core built-in types are frequently archetypes or foundational paradigms for many derived concepts in 3rd-party libraries. Despite the fact that they have implementations, they also form a basis set of interfaces which are part of the lexicon of "Pythonic-ness". Something like a "set" is a very very common and useful base interface for programming. If we were to cloud or confound the (relatively clean) semantics around sets & set algebra by introducing ordinality, that same cloudiness will spread throughout the ecosystem. Unlike dict()s, which are a more computer science-y data structure with lots of details, sets are simple things with mathematical algebraic properties. These allow the user to compactly specify *what* they want done, without delving into the details of *how*. In my experience, this is always preferable, for ease of use, ease of learning, and correctness. Adding orderedness as a concern now imposes an additional cognitive burden onto the coder because they must be concerned with *how*. Ultimately, even if we establish that there is major utility in insertion-ordered sets, unordered sets also clearly have utility as well (as a more relaxed interface, with fewer promises). So in that case, which do we want as the default? I think about it like this: In a future time when sets maintain insertion order by default, every time I see library docs that require casting sets via collections.unordered_set() will feel like sand in the semantic vaseline. I would much rather have things the other way around. I don't really have a horse in this race, and I'm sure the dev community will arrive at a good resolution to this issue. However, I would just encourage folks to think less about "utility of the implementation", and more about "economy of semantics". As Larry pointed out earlier, ordered dicts posed a problem for MicroPython. This actually isn't a minor issue -- in the long run, forked semantics are a Bad Thing for the language (even for a language that is incorrectly eponymized with snakes!). -Peter
Got it. Thanks On Fri, Dec 20, 2019, 2:34 AM Inada Naoki <songofacandy@gmail.com> wrote:
On Fri, Dec 20, 2019 at 4:15 PM Wes Turner <wes.turner@gmail.com> wrote:
How slow and space-inefficient would it be to just implement the set
methods on top of dict?
Speed: Dict doesn't cache the position of the first item. Calling next(iter(D)) repeatedly is O(N) in worst case. Space: It waste 8bytes per member.
Do dicts lose insertion order when a key is deleted? AFAIU, OrderedDict
do not lose insertion order on delete.
Dict keeps insertion order after deletion too.
Would this limit the utility of an ordered set as a queue? What set methods does a queue need to have?
I want O(1) D.popleft(). (The name is borrowed from deque. popfirst() would be better maybe).
-- Inada Naoki <songofacandy@gmail.com>
[Wes Turner <wes.turner@gmail.com>]
How slow and space-inefficient would it be to just implement the set methods on top of dict?
[Inada Naoki <songofacandy@gmail.com>]
Speed: Dict doesn't cache the position of the first item. Calling next(iter(D)) repeatedly is O(N) in worst case. ...
See also Raymond's (only) message in this thread. We would also lose low-level speed optimizations specific to the current set implementation. And we would need to define what "insertion order" _means_ for operators (union, intersection, symmetric difference) that build a new set out of other sets without a user-level "insert" in sight. However that's defined, it may constrain possible implementations in ways that are inherently slower (for example, may require the implementation to iterate over the larger operand where they currently iterate over the smaller). And the current set implementation (like the older dict implementation) never needs to "rebuild the table from scratch" unless the cardinality of the set keeps growing. As Raymond telegraphically hinted, the current dict implementation can, at times, in the presence of deletions, require rebuilding the table from scratch even if the dict's maximum size remains bounded. That last can't be seen directly from Python code (rebuilding the table is invisible at the Python level). Here's a short example: d = dict.fromkeys(range(3)) while True: del d[0] d[0] = None Run that with a breakpoint set on dictobject.c's `dictresize()` function. You'll see that it rebuilds the table from scratch every 3rd time through the loop. In effect, for the purpose of deciding whether it needs to rebuild, the current dict implementation sees no difference between adding a new element and deleting an existing element Deleting leaves behind "holes" that periodically have to be squashed out of existence so that insertion order can be maintained in a dead simple way upon future additions.
On Sat, Dec 21, 2019 at 3:17 AM Tim Peters <tim.peters@gmail.com> wrote:
[Wes Turner <wes.turner@gmail.com>]
How slow and space-inefficient would it be to just implement the set methods on top of dict?
[Inada Naoki <songofacandy@gmail.com>]
Speed: Dict doesn't cache the position of the first item. Calling next(iter(D)) repeatedly is O(N) in worst case. ...
See also Raymond's (only) message in this thread. We would also lose low-level speed optimizations specific to the current set implementation.
I read this thread, and I understand all of them. I just meant the performance of the next(iter(D)) is the most critical part when you implement orderdset on top of the current dict and use it as a queue. This code should be O(N), but it's O(N^2) if q is implemented on top of the dict. while q: item = q.popleft() Sorry for the confusion.
And the current set implementation (like the older dict implementation) never needs to "rebuild the table from scratch" unless the cardinality of the set keeps growing.
It is a bit misleading. If "the cardinality of the set" means len(S), set requires the rebuild in low frequency if the its items are random. Anyway, it is a smaller problem than next(iter(D)) because of the amortized cost is O(1). Current dict is not optimized for queue usage. Regards, -- Inada Naoki <songofacandy@gmail.com>
[Inada Naoki <songofacandy@gmail.com>]
I just meant the performance of the next(iter(D)) is the most critical part when you implement orderdset on top of the current dict and use it as a queue.
Which is a good point. I added a lot more, though, because Wes didn't even mention queues in his question: [Wes Turner <wes.turner@gmail.com>]
How slow and space-inefficient would it be to just implement the set methods on top of dict?
I can't guess what he was _really_ asking about, so now he's got lots of answers ;-)
This code should be O(N), but it's O(N^2) if q is implemented on top of the dict.
while q: item = q.popleft()
Sorry for the confusion.
To flesh this out, the FIFO queue entries are all together in a contiguous vector, with no "holes" in the middle. Pushing a new item appends "to the right" (higher index in the vector), while popping an item leaves "a hole" at the left. But there are no holes "in the middle" in this case. So the first real entry is at a higher index with every pop, but iteration starts at index 0 every time. The more pops there are, the more holes iteration has to skip over, one at a time, to find the first real entry remaining. Until pushing a new item would fall off the right end of the vector. Then the table is rebuilt from scratch, moving all the real entries to form a contiguous block starting at index 0 again.
And the current set implementation (like the older dict implementation) never needs to "rebuild the table from scratch" unless the cardinality of the set keeps growing.
It is a bit misleading. If "the cardinality of the set" means len(S),
Yes.
set requires the rebuild in low frequency if the its items are random.
Anyway, it is a smaller problem than next(iter(D)) because of the amortized cost is O(1).
For the specific case of FIFO queues, sure. In general, though, "O(1) period". is more desirable than "amortized O(1)". This is especially true when sets get very large.
Current dict is not optimized for queue usage.
Nor should it be:-) That's what collections.deque is for, pushes and pops, at both ends, always time O(1) period. No holes, no internal rearrangements, and O(1) "wasted" space.
(mixing together messages in the thread, sorry threaded-view readers) On 12/19/19 3:15 PM, Tim Peters wrote:
[Nick]
I took Larry's request a slightly different way: [...] Only Larry can answer whether that would meet his perceived need. My _guess_ is that he wouldn't know OrderedSet existed, and, even if he did, he'd use a dict with None values anyway (because it's less hassle and does everything he wanted).
At last, something I can speak knowledgeably about: Larry's use case! Regarding Larry, I'd say 1. his use case was small enough that almost anything maintaining order would have worked, even a list, 2. he often /does have/ a pretty good idea what goodies are salted away in the Python standard library, and 3. a hypothetical collections.OrderedSet would probably work just fine. And he'd probably use it too, as that makes the code clearer / easier to read. If you read code using an OrderedSet, you know what the intent was and what the semantics are. If you see code using a dict but setting the values to None, you think "that's crazy" and now you have a puzzle to solve. Let's hope those comments are accurate! Also, speaking personally, at least once (maybe twice?) in this thread folks have asked "would YOU, Mr Larry, really want ordered sets if it meant sets were slightly slower?" The first thing I'd say is, I'm not sure why people should care about what's best for me. That's sweet of you! But you really shouldn't. The second thing is, my needs are modest, so the speed hit we're talking about would likely be statistically insignificant, for me. And the third thing is, I don't really put the set() API through much of a workout anyway. I build sets, I add and remove items, I iterate over them, I do the occasional union, and only very rarely do I use anything else. Even then, when I write code where I reach for a fancy operation like intersection or symmetric_difference--what tends to happen is, I have a rethink and a refactor, and after I simplify my code I don't need the fancy operations anymore. I can't remember the last time I used one of these fancy operations where the code really stuck around for a long time. So, personally? Sure, I'd take that tradeoff. As already established, I like that dict objects maintain their order, and I think it'd be swell if sets maintained their order too. I suspect the performance hit wouldn't affect /me/ in any meaningful way, and I could benefit from the order-maintaining semantics. I bet it'd have other benefits too, like making regression tests more stable. And my (admittedly uninformed!) assumption is, the loss of performance would mostly be in these sophisticated operations that I never seem to use anyway. But I don't mistake my needs for the needs of the Python community at large. I'd be mighty interested to read other folks coming forward and saying, for example, "please don't make set objects any slower!" and talking us through neat real-world use cases. Bonus points if you include mind-boggling numbers and benchmarks! On 12/20/19 5:09 AM, Peter Wang wrote:
As Larry pointed out earlier, ordered dicts posed a problem for MicroPython.
Just a minor correction: that wasn't me, that was Petr Viktorin. Ten days left until we retire 2.7, //arry/
Larry Hastings wrote:
a hypothetical collections.OrderedSet would probably work just fine.
I'm also in agreement with adding a collections.OrderedSet. There certainly seems to be a practical use case for a Set that preserves insertion order, but with significant push back to modifying the existing Set implementation (primarily due to performance concerns). If later down the road an improvement to OrderedSet is made that gives it equal or better average performance across the board compared to Set, we can consider changing the default implementation *if* there's significant demand for it. Larry Hastings wrote:
And he'd probably use it too, as that makes the code clearer / easier to read. If you read code using an OrderedSet, you know what the intent was and what the semantics are. If you see code using a dict but setting the values to None, you think "that's crazy" and now you have a puzzle to solve. Let's hope those comments are accurate!
This is an excellent point. My first thought if someone were to be using a Dict with all values set to None would be: why not use a Set here? As Larry said, the need for preservation of insertion order would have to be explicitly described in the code comments. Realistically speaking, code comments are not something that can be consistently relied upon, especially if it's part of some boilerplate format that gets replicated or if the container is used in multiple locations. I'd much rather have a dedicated data structure that clearly describes what it does based on the name alone, IMO that's a million times better for readability purposes. Also, this is mostly speculation since I haven't ran any benchmarks for an OrderedSet implementation, but I strongly suspect that OrderedSet will end up having better average performance for add, pop, and update than trying to use a dictionary with None values (assuming it's implemented in C or at least with a C accelerator). Not to mention allowing the ability to update (for both addition and removal) based on intersections and unions across ordered sets, which currently isn't possible to do with dictionaries (at least not directly with |=, &=, -=. and ^=). I'm not sure how much of a practical use case this would have for ordered sets specifically, but it's a nice bonus. On Sat, Dec 21, 2019 at 7:59 AM Larry Hastings <larry@hastings.org> wrote:
(mixing together messages in the thread, sorry threaded-view readers)
On 12/19/19 3:15 PM, Tim Peters wrote:
[Nick]
I took Larry's request a slightly different way:
[...]
Only Larry can answer whether that would meet his perceived need. My _guess_ is that he wouldn't know OrderedSet existed, and, even if he did, he'd use a dict with None values anyway (because it's less hassle and does everything he wanted).
At last, something I can speak knowledgeably about: Larry's use case! Regarding Larry, I'd say
1. his use case was small enough that almost anything maintaining order would have worked, even a list, 2. he often *does have* a pretty good idea what goodies are salted away in the Python standard library, and 3. a hypothetical collections.OrderedSet would probably work just fine. And he'd probably use it too, as that makes the code clearer / easier to read. If you read code using an OrderedSet, you know what the intent was and what the semantics are. If you see code using a dict but setting the values to None, you think "that's crazy" and now you have a puzzle to solve. Let's hope those comments are accurate!
Also, speaking personally, at least once (maybe twice?) in this thread folks have asked "would YOU, Mr Larry, really want ordered sets if it meant sets were slightly slower?"
The first thing I'd say is, I'm not sure why people should care about what's best for me. That's sweet of you! But you really shouldn't.
The second thing is, my needs are modest, so the speed hit we're talking about would likely be statistically insignificant, for me.
And the third thing is, I don't really put the set() API through much of a workout anyway. I build sets, I add and remove items, I iterate over them, I do the occasional union, and only very rarely do I use anything else. Even then, when I write code where I reach for a fancy operation like intersection or symmetric_difference--what tends to happen is, I have a rethink and a refactor, and after I simplify my code I don't need the fancy operations anymore. I can't remember the last time I used one of these fancy operations where the code really stuck around for a long time.
So, personally? Sure, I'd take that tradeoff. As already established, I like that dict objects maintain their order, and I think it'd be swell if sets maintained their order too. I suspect the performance hit wouldn't affect *me* in any meaningful way, and I could benefit from the order-maintaining semantics. I bet it'd have other benefits too, like making regression tests more stable. And my (admittedly uninformed!) assumption is, the loss of performance would mostly be in these sophisticated operations that I never seem to use anyway.
But I don't mistake my needs for the needs of the Python community at large. I'd be mighty interested to read other folks coming forward and saying, for example, "please don't make set objects any slower!" and talking us through neat real-world use cases. Bonus points if you include mind-boggling numbers and benchmarks!
On 12/20/19 5:09 AM, Peter Wang wrote:
As Larry pointed out earlier, ordered dicts posed a problem for MicroPython.
Just a minor correction: that wasn't me, that was Petr Viktorin.
Ten days left until we retire 2.7,
*/arry* _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/JYSJ55CW... Code of Conduct: http://python.org/psf/codeofconduct/
To begin with, please compare performance of dict-with-None-values to that of a set, for operations that can be expressed by both (e.g. both have update()). On Mon, Dec 23, 2019 at 18:08 Kyle Stanley <aeros167@gmail.com> wrote:
Larry Hastings wrote:
a hypothetical collections.OrderedSet would probably work just fine.
I'm also in agreement with adding a collections.OrderedSet. There certainly seems to be a practical use case for a Set that preserves insertion order, but with significant push back to modifying the existing Set implementation (primarily due to performance concerns).
If later down the road an improvement to OrderedSet is made that gives it equal or better average performance across the board compared to Set, we can consider changing the default implementation *if* there's significant demand for it.
Larry Hastings wrote:
And he'd probably use it too, as that makes the code clearer / easier to read. If you read code using an OrderedSet, you know what the intent was and what the semantics are. If you see code using a dict but setting the values to None, you think "that's crazy" and now you have a puzzle to solve. Let's hope those comments are accurate!
This is an excellent point. My first thought if someone were to be using a Dict with all values set to None would be: why not use a Set here? As Larry said, the need for preservation of insertion order would have to be explicitly described in the code comments. Realistically speaking, code comments are not something that can be consistently relied upon, especially if it's part of some boilerplate format that gets replicated or if the container is used in multiple locations. I'd much rather have a dedicated data structure that clearly describes what it does based on the name alone, IMO that's a million times better for readability purposes.
Also, this is mostly speculation since I haven't ran any benchmarks for an OrderedSet implementation, but I strongly suspect that OrderedSet will end up having better average performance for add, pop, and update than trying to use a dictionary with None values (assuming it's implemented in C or at least with a C accelerator).
Not to mention allowing the ability to update (for both addition and removal) based on intersections and unions across ordered sets, which currently isn't possible to do with dictionaries (at least not directly with |=, &=, -=. and ^=). I'm not sure how much of a practical use case this would have for ordered sets specifically, but it's a nice bonus.
On Sat, Dec 21, 2019 at 7:59 AM Larry Hastings <larry@hastings.org> wrote:
(mixing together messages in the thread, sorry threaded-view readers)
On 12/19/19 3:15 PM, Tim Peters wrote:
[Nick]
I took Larry's request a slightly different way:
[...]
Only Larry can answer whether that would meet his perceived need. My _guess_ is that he wouldn't know OrderedSet existed, and, even if he did, he'd use a dict with None values anyway (because it's less hassle and does everything he wanted).
At last, something I can speak knowledgeably about: Larry's use case! Regarding Larry, I'd say
1. his use case was small enough that almost anything maintaining order would have worked, even a list, 2. he often *does have* a pretty good idea what goodies are salted away in the Python standard library, and 3. a hypothetical collections.OrderedSet would probably work just fine. And he'd probably use it too, as that makes the code clearer / easier to read. If you read code using an OrderedSet, you know what the intent was and what the semantics are. If you see code using a dict but setting the values to None, you think "that's crazy" and now you have a puzzle to solve. Let's hope those comments are accurate!
Also, speaking personally, at least once (maybe twice?) in this thread folks have asked "would YOU, Mr Larry, really want ordered sets if it meant sets were slightly slower?"
The first thing I'd say is, I'm not sure why people should care about what's best for me. That's sweet of you! But you really shouldn't.
The second thing is, my needs are modest, so the speed hit we're talking about would likely be statistically insignificant, for me.
And the third thing is, I don't really put the set() API through much of a workout anyway. I build sets, I add and remove items, I iterate over them, I do the occasional union, and only very rarely do I use anything else. Even then, when I write code where I reach for a fancy operation like intersection or symmetric_difference--what tends to happen is, I have a rethink and a refactor, and after I simplify my code I don't need the fancy operations anymore. I can't remember the last time I used one of these fancy operations where the code really stuck around for a long time.
So, personally? Sure, I'd take that tradeoff. As already established, I like that dict objects maintain their order, and I think it'd be swell if sets maintained their order too. I suspect the performance hit wouldn't affect *me* in any meaningful way, and I could benefit from the order-maintaining semantics. I bet it'd have other benefits too, like making regression tests more stable. And my (admittedly uninformed!) assumption is, the loss of performance would mostly be in these sophisticated operations that I never seem to use anyway.
But I don't mistake my needs for the needs of the Python community at large. I'd be mighty interested to read other folks coming forward and saying, for example, "please don't make set objects any slower!" and talking us through neat real-world use cases. Bonus points if you include mind-boggling numbers and benchmarks!
On 12/20/19 5:09 AM, Peter Wang wrote:
As Larry pointed out earlier, ordered dicts posed a problem for MicroPython.
Just a minor correction: that wasn't me, that was Petr Viktorin.
Ten days left until we retire 2.7,
*/arry* _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/JYSJ55CW... Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/2KEAXZBQ... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido (mobile)
Even though I was the first person in this thread to suggest collections.OrderedSet, I'm "meh" about it now. As I read more and played with the sortedcollections package, it seemed to me that while I might want a set that iterated in a determinate and meaningful order moderately often, insertion order would make up a small share of those use cases. On Mon, Dec 23, 2019, 8:05 PM Kyle Stanley <aeros167@gmail.com> wrote:
Larry Hastings wrote:
a hypothetical collections.OrderedSet would probably work just fine.
I'm also in agreement with adding a collections.OrderedSet. There certainly seems to be a practical use case for a Set that preserves insertion order, but with significant push back to modifying the existing Set implementation (primarily due to performance concerns).
If later down the road an improvement to OrderedSet is made that gives it equal or better average performance across the board compared to Set, we can consider changing the default implementation *if* there's significant demand for it.
Larry Hastings wrote:
And he'd probably use it too, as that makes the code clearer / easier to read. If you read code using an OrderedSet, you know what the intent was and what the semantics are. If you see code using a dict but setting the values to None, you think "that's crazy" and now you have a puzzle to solve. Let's hope those comments are accurate!
This is an excellent point. My first thought if someone were to be using a Dict with all values set to None would be: why not use a Set here? As Larry said, the need for preservation of insertion order would have to be explicitly described in the code comments. Realistically speaking, code comments are not something that can be consistently relied upon, especially if it's part of some boilerplate format that gets replicated or if the container is used in multiple locations. I'd much rather have a dedicated data structure that clearly describes what it does based on the name alone, IMO that's a million times better for readability purposes.
Also, this is mostly speculation since I haven't ran any benchmarks for an OrderedSet implementation, but I strongly suspect that OrderedSet will end up having better average performance for add, pop, and update than trying to use a dictionary with None values (assuming it's implemented in C or at least with a C accelerator).
Not to mention allowing the ability to update (for both addition and removal) based on intersections and unions across ordered sets, which currently isn't possible to do with dictionaries (at least not directly with |=, &=, -=. and ^=). I'm not sure how much of a practical use case this would have for ordered sets specifically, but it's a nice bonus.
On Sat, Dec 21, 2019 at 7:59 AM Larry Hastings <larry@hastings.org> wrote:
(mixing together messages in the thread, sorry threaded-view readers)
On 12/19/19 3:15 PM, Tim Peters wrote:
[Nick]
I took Larry's request a slightly different way:
[...]
Only Larry can answer whether that would meet his perceived need. My _guess_ is that he wouldn't know OrderedSet existed, and, even if he did, he'd use a dict with None values anyway (because it's less hassle and does everything he wanted).
At last, something I can speak knowledgeably about: Larry's use case! Regarding Larry, I'd say
1. his use case was small enough that almost anything maintaining order would have worked, even a list, 2. he often *does have* a pretty good idea what goodies are salted away in the Python standard library, and 3. a hypothetical collections.OrderedSet would probably work just fine. And he'd probably use it too, as that makes the code clearer / easier to read. If you read code using an OrderedSet, you know what the intent was and what the semantics are. If you see code using a dict but setting the values to None, you think "that's crazy" and now you have a puzzle to solve. Let's hope those comments are accurate!
Also, speaking personally, at least once (maybe twice?) in this thread folks have asked "would YOU, Mr Larry, really want ordered sets if it meant sets were slightly slower?"
The first thing I'd say is, I'm not sure why people should care about what's best for me. That's sweet of you! But you really shouldn't.
The second thing is, my needs are modest, so the speed hit we're talking about would likely be statistically insignificant, for me.
And the third thing is, I don't really put the set() API through much of a workout anyway. I build sets, I add and remove items, I iterate over them, I do the occasional union, and only very rarely do I use anything else. Even then, when I write code where I reach for a fancy operation like intersection or symmetric_difference--what tends to happen is, I have a rethink and a refactor, and after I simplify my code I don't need the fancy operations anymore. I can't remember the last time I used one of these fancy operations where the code really stuck around for a long time.
So, personally? Sure, I'd take that tradeoff. As already established, I like that dict objects maintain their order, and I think it'd be swell if sets maintained their order too. I suspect the performance hit wouldn't affect *me* in any meaningful way, and I could benefit from the order-maintaining semantics. I bet it'd have other benefits too, like making regression tests more stable. And my (admittedly uninformed!) assumption is, the loss of performance would mostly be in these sophisticated operations that I never seem to use anyway.
But I don't mistake my needs for the needs of the Python community at large. I'd be mighty interested to read other folks coming forward and saying, for example, "please don't make set objects any slower!" and talking us through neat real-world use cases. Bonus points if you include mind-boggling numbers and benchmarks!
On 12/20/19 5:09 AM, Peter Wang wrote:
As Larry pointed out earlier, ordered dicts posed a problem for MicroPython.
Just a minor correction: that wasn't me, that was Petr Viktorin.
Ten days left until we retire 2.7,
*/arry* _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/JYSJ55CW... Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/2KEAXZBQ... Code of Conduct: http://python.org/psf/codeofconduct/
To begin with, please compare performance of dict-with-None-values to that of a set, for operations that can be expressed by both (e.g. both have update()).
Good idea. It would be useful to have a baseline comparison. I emboldened the comparisons with a notable difference. Tested on: Version: Python 3.8.0 OS: Linux kernel 5.4.5 CPU: Intel i5-4460 Initialization (about the same):
timeit.timeit("set(i for i in range(1000))", number=100_000) 4.878508095000143 timeit.timeit("{i: None for i in range(1000)}", number=100_000) 4.9192055170024105
Membership (about the same):
timeit.timeit("random.randint(0, 999) in s", setup="import random; s = set(i for i in range(1000))", number=10_000_000) 7.992674231001729 timeit.timeit("random.randint(0, 999) in d", setup="import random; d = {i: None for i in range(1000)}", number=10_000_000) 8.032404395999038
*Add* (much faster for dicts):
timeit.timeit("s = set(); s.add(0)", number=100_000_000) 13.330938750001224 timeit.timeit("d = {}; d[0] = None", number=100_000_000) 5.788865385999088
*Update* (much faster for updating sets):
timeit.timeit("s.update(other_s)", setup="s = set(); other_s = set(i for i in range(1000))", number=1_000_000) 6.2428832090008655 timeit.timeit("d.update(other_d)", setup="d = {}; other_d = {i: None for i in range(1000)}", number=1_000_000) 13.031371458000649
*Create list from keys* (faster for sets):
timeit.timeit("list(s)", setup="s = set(i for i in range(1000))", number=10_00_000) 5.24364303600305 timeit.timeit("list(d)", setup="d = {i: None for i in range(1000)}", number=10_00_000) 6.303017043999716
Removal isn't easily comparable, since s.pop(), s.discard(), d.pop(), and d.popitem() all behave quite differently. Also, I'm sure these benchmarks could be improved significantly, particularly with the "Add" ones. This should provide a decent general idea though. On Mon, Dec 23, 2019 at 8:12 PM Guido van Rossum <guido@python.org> wrote:
To begin with, please compare performance of dict-with-None-values to that of a set, for operations that can be expressed by both (e.g. both have update()).
On Mon, Dec 23, 2019 at 18:08 Kyle Stanley <aeros167@gmail.com> wrote:
Larry Hastings wrote:
a hypothetical collections.OrderedSet would probably work just fine.
I'm also in agreement with adding a collections.OrderedSet. There certainly seems to be a practical use case for a Set that preserves insertion order, but with significant push back to modifying the existing Set implementation (primarily due to performance concerns).
If later down the road an improvement to OrderedSet is made that gives it equal or better average performance across the board compared to Set, we can consider changing the default implementation *if* there's significant demand for it.
And he'd probably use it too, as that makes the code clearer / easier to read. If you read code using an OrderedSet, you know what the intent was and what the semantics are. If you see code using a dict but setting
Larry Hastings wrote: the values to None, you think "that's crazy" and now you have a puzzle to solve. Let's hope those comments are accurate!
This is an excellent point. My first thought if someone were to be using a Dict with all values set to None would be: why not use a Set here? As Larry said, the need for preservation of insertion order would have to be explicitly described in the code comments. Realistically speaking, code comments are not something that can be consistently relied upon, especially if it's part of some boilerplate format that gets replicated or if the container is used in multiple locations. I'd much rather have a dedicated data structure that clearly describes what it does based on the name alone, IMO that's a million times better for readability purposes.
Also, this is mostly speculation since I haven't ran any benchmarks for an OrderedSet implementation, but I strongly suspect that OrderedSet will end up having better average performance for add, pop, and update than trying to use a dictionary with None values (assuming it's implemented in C or at least with a C accelerator).
Not to mention allowing the ability to update (for both addition and removal) based on intersections and unions across ordered sets, which currently isn't possible to do with dictionaries (at least not directly with |=, &=, -=. and ^=). I'm not sure how much of a practical use case this would have for ordered sets specifically, but it's a nice bonus.
On Sat, Dec 21, 2019 at 7:59 AM Larry Hastings <larry@hastings.org> wrote:
(mixing together messages in the thread, sorry threaded-view readers)
On 12/19/19 3:15 PM, Tim Peters wrote:
[Nick]
I took Larry's request a slightly different way:
[...]
Only Larry can answer whether that would meet his perceived need. My _guess_ is that he wouldn't know OrderedSet existed, and, even if he did, he'd use a dict with None values anyway (because it's less hassle and does everything he wanted).
At last, something I can speak knowledgeably about: Larry's use case! Regarding Larry, I'd say
1. his use case was small enough that almost anything maintaining order would have worked, even a list, 2. he often *does have* a pretty good idea what goodies are salted away in the Python standard library, and 3. a hypothetical collections.OrderedSet would probably work just fine. And he'd probably use it too, as that makes the code clearer / easier to read. If you read code using an OrderedSet, you know what the intent was and what the semantics are. If you see code using a dict but setting the values to None, you think "that's crazy" and now you have a puzzle to solve. Let's hope those comments are accurate!
Also, speaking personally, at least once (maybe twice?) in this thread folks have asked "would YOU, Mr Larry, really want ordered sets if it meant sets were slightly slower?"
The first thing I'd say is, I'm not sure why people should care about what's best for me. That's sweet of you! But you really shouldn't.
The second thing is, my needs are modest, so the speed hit we're talking about would likely be statistically insignificant, for me.
And the third thing is, I don't really put the set() API through much of a workout anyway. I build sets, I add and remove items, I iterate over them, I do the occasional union, and only very rarely do I use anything else. Even then, when I write code where I reach for a fancy operation like intersection or symmetric_difference--what tends to happen is, I have a rethink and a refactor, and after I simplify my code I don't need the fancy operations anymore. I can't remember the last time I used one of these fancy operations where the code really stuck around for a long time.
So, personally? Sure, I'd take that tradeoff. As already established, I like that dict objects maintain their order, and I think it'd be swell if sets maintained their order too. I suspect the performance hit wouldn't affect *me* in any meaningful way, and I could benefit from the order-maintaining semantics. I bet it'd have other benefits too, like making regression tests more stable. And my (admittedly uninformed!) assumption is, the loss of performance would mostly be in these sophisticated operations that I never seem to use anyway.
But I don't mistake my needs for the needs of the Python community at large. I'd be mighty interested to read other folks coming forward and saying, for example, "please don't make set objects any slower!" and talking us through neat real-world use cases. Bonus points if you include mind-boggling numbers and benchmarks!
On 12/20/19 5:09 AM, Peter Wang wrote:
As Larry pointed out earlier, ordered dicts posed a problem for MicroPython.
Just a minor correction: that wasn't me, that was Petr Viktorin.
Ten days left until we retire 2.7,
*/arry* _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/JYSJ55CW... Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/2KEAXZBQ... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido (mobile)
On Tue, Dec 24, 2019 at 1:57 PM Kyle Stanley <aeros167@gmail.com> wrote:
Add (much faster for dicts):
timeit.timeit("s = set(); s.add(0)", number=100_000_000) 13.330938750001224 timeit.timeit("d = {}; d[0] = None", number=100_000_000) 5.788865385999088
I think this is an artifact of Python not having an empty set literal.
timeit.timeit("s = set(); s.add(0)", number=100_000_000) 13.275540543720126 timeit.timeit("d = dict(); d[0] = None", number=100_000_000) 13.044076398015022 timeit.timeit("d = {}; d[0] = None", number=100_000_000) 6.088695731014013 timeit.timeit("s = {1}; s.add(0)", number=100_000_000) 9.260965215042233 timeit.timeit("d = {1:2}; d[0] = None", number=100_000_000) 8.75433829985559
When both use the constructor call or both use a literal, the difference is far smaller. I'd call this one a wash. ChrisA
David Mertz writes:
Even though I was the first person in this thread to suggest collections.OrderedSet, I'm "meh" about it now. As I read more and played with the sortedcollections package, it seemed to me that while I might want a set that iterated in a determinate and meaningful order moderately often, insertion order would make up a small share of those use cases.
On the other hand, insertion order is one of the most prominent of the determinate meaningful orders where you would have to do ugly things to use "sorted" to get that order. Any application where you have an unreliable message bus feeding a queue (so that you might get duplicate objects but it's bad to process the same object twice) would be a potential application of insertion-ordered sets. Steve
On Mon, Dec 23, 2019 at 8:56 PM Kyle Stanley <aeros167@gmail.com> wrote:
To begin with, please compare performance of dict-with-None-values to that of a set, for operations that can be expressed by both (e.g. both have update()).
Good idea. It would be useful to have a baseline comparison. I emboldened the comparisons with a notable difference.
Tested on: Version: Python 3.8.0 OS: Linux kernel 5.4.5 CPU: Intel i5-4460
Initialization (about the same):
timeit.timeit("set(i for i in range(1000))", number=100_000) 4.878508095000143 timeit.timeit("{i: None for i in range(1000)}", number=100_000) 4.9192055170024105
Membership (about the same):
timeit.timeit("random.randint(0, 999) in s", setup="import random; s = set(i for i in range(1000))", number=10_000_000) 7.992674231001729 timeit.timeit("random.randint(0, 999) in d", setup="import random; d = {i: None for i in range(1000)}", number=10_000_000) 8.032404395999038
Add (much faster for dicts):
timeit.timeit("s = set(); s.add(0)", number=100_000_000) 13.330938750001224 timeit.timeit("d = {}; d[0] = None", number=100_000_000) 5.788865385999088
Update (much faster for updating sets):
timeit.timeit("s.update(other_s)", setup="s = set(); other_s = set(i for i in range(1000))", number=1_000_000) 6.2428832090008655 timeit.timeit("d.update(other_d)", setup="d = {}; other_d = {i: None for i in range(1000)}", number=1_000_000) 13.031371458000649
Create list from keys (faster for sets):
timeit.timeit("list(s)", setup="s = set(i for i in range(1000))", number=10_00_000) 5.24364303600305 timeit.timeit("list(d)", setup="d = {i: None for i in range(1000)}", number=10_00_000) 6.303017043999716
Removal isn't easily comparable, since s.pop(), s.discard(), d.pop(), and d.popitem() all behave quite differently. Also, I'm sure these benchmarks could be improved significantly, particularly with the "Add" ones. This should provide a decent general idea though.
On Mon, Dec 23, 2019 at 8:12 PM Guido van Rossum <guido@python.org> wrote:
To begin with, please compare performance of dict-with-None-values to that of a set, for operations that can be expressed by both (e.g. both have update()).
On Mon, Dec 23, 2019 at 18:08 Kyle Stanley <aeros167@gmail.com> wrote:
Larry Hastings wrote:
a hypothetical collections.OrderedSet would probably work just fine.
I'm also in agreement with adding a collections.OrderedSet. There certainly seems to be a practical use case for a Set that preserves insertion order, but with significant push back to modifying the existing Set implementation (primarily due to performance concerns).
If later down the road an improvement to OrderedSet is made that gives it equal or better average performance across the board compared to Set, we can consider changing the default implementation if there's significant demand for it.
Larry Hastings wrote:
And he'd probably use it too, as that makes the code clearer / easier to read. If you read code using an OrderedSet, you know what the intent was and what the semantics are. If you see code using a dict but setting the values to None, you think "that's crazy" and now you have a puzzle to solve. Let's hope those comments are accurate!
This is an excellent point. My first thought if someone were to be using a Dict with all values set to None would be: why not use a Set here? As Larry said, the need for preservation of insertion order would have to be explicitly described in the code comments. Realistically speaking, code comments are not something that can be consistently relied upon, especially if it's part of some boilerplate format that gets replicated or if the container is used in multiple locations. I'd much rather have a dedicated data structure that clearly describes what it does based on the name alone, IMO that's a million times better for readability purposes.
Also, this is mostly speculation since I haven't ran any benchmarks for an OrderedSet implementation, but I strongly suspect that OrderedSet will end up having better average performance for add, pop, and update than trying to use a dictionary with None values (assuming it's implemented in C or at least with a C accelerator).
Not to mention allowing the ability to update (for both addition and removal) based on intersections and unions across ordered sets, which currently isn't possible to do with dictionaries (at least not directly with |=, &=, -=. and ^=). I'm not sure how much of a practical use case this would have for ordered sets specifically, but it's a nice bonus.
On Sat, Dec 21, 2019 at 7:59 AM Larry Hastings <larry@hastings.org> wrote:
(mixing together messages in the thread, sorry threaded-view readers)
On 12/19/19 3:15 PM, Tim Peters wrote:
[Nick]
I took Larry's request a slightly different way:
[...]
Only Larry can answer whether that would meet his perceived need. My _guess_ is that he wouldn't know OrderedSet existed, and, even if he did, he'd use a dict with None values anyway (because it's less hassle and does everything he wanted).
At last, something I can speak knowledgeably about: Larry's use case! Regarding Larry, I'd say
his use case was small enough that almost anything maintaining order would have worked, even a list, he often does have a pretty good idea what goodies are salted away in the Python standard library, and a hypothetical collections.OrderedSet would probably work just fine. And he'd probably use it too, as that makes the code clearer / easier to read. If you read code using an OrderedSet, you know what the intent was and what the semantics are. If you see code using a dict but setting the values to None, you think "that's crazy" and now you have a puzzle to solve. Let's hope those comments are accurate!
Also, speaking personally, at least once (maybe twice?) in this thread folks have asked "would YOU, Mr Larry, really want ordered sets if it meant sets were slightly slower?"
The first thing I'd say is, I'm not sure why people should care about what's best for me. That's sweet of you! But you really shouldn't.
The second thing is, my needs are modest, so the speed hit we're talking about would likely be statistically insignificant, for me.
And the third thing is, I don't really put the set() API through much of a workout anyway. I build sets, I add and remove items, I iterate over them, I do the occasional union, and only very rarely do I use anything else. Even then, when I write code where I reach for a fancy operation like intersection or symmetric_difference--what tends to happen is, I have a rethink and a refactor, and after I simplify my code I don't need the fancy operations anymore. I can't remember the last time I used one of these fancy operations where the code really stuck around for a long time.
So, personally? Sure, I'd take that tradeoff. As already established, I like that dict objects maintain their order, and I think it'd be swell if sets maintained their order too. I suspect the performance hit wouldn't affect me in any meaningful way, and I could benefit from the order-maintaining semantics. I bet it'd have other benefits too, like making regression tests more stable. And my (admittedly uninformed!) assumption is, the loss of performance would mostly be in these sophisticated operations that I never seem to use anyway.
But I don't mistake my needs for the needs of the Python community at large. I'd be mighty interested to read other folks coming forward and saying, for example, "please don't make set objects any slower!" and talking us through neat real-world use cases. Bonus points if you include mind-boggling numbers and benchmarks!
On 12/20/19 5:09 AM, Peter Wang wrote:
As Larry pointed out earlier, ordered dicts posed a problem for MicroPython.
Just a minor correction: that wasn't me, that was Petr Viktorin.
Ten days left until we retire 2.7,
/arry
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/JYSJ55CW... Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/2KEAXZBQ... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido (mobile)
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/A7KJQJBC... Code of Conduct: http://python.org/psf/codeofconduct/
Sorry! A previous attempt to reply got sent before I typed anything :-( Very briefly:
timeit.timeit("set(i for i in range(1000))", number=100_000)
[and other examples using a range of integers] The collision resolution strategy for sets evolved to be fancier than for dicts, to reduce cache misses. This is important for sets because the _only_ interesting thing about an element wrt a set is whether or not the set contains it. Lookup speed is everything for sets. If you use a contiguous range of "reasonable" integers for keys, the integer hash function is perfect: there's never a collision. So any such test misses all the work Raymond did to speed set lookups. String keys have sufficiently "random" hashes to reliably create collisions, though. Cache misses can, of course, have massive effects on timing.
Add (much faster for dicts):
timeit.timeit("s = set(); s.add(0)", number=100_000_000) 13.330938750001224 timeit.timeit("d = {}; d[0] = None", number=100_000_000) 5.788865385999088
In the former case you're primarily measuring the time to look up the "add" method of sets (that's more expensive than adding 0 to the set). A better comparison would, e.g., move `add = s.add` to a setup line, and use plain "add(0)" in the loop. That's it!
Chris Angelico wrote:
I think this is an artifact of Python not having an empty set literal. [snip] When both use the constructor call or both use a literal, the difference is far smaller. I'd call this one a wash.
Ah, good catch. I hadn't considered that it would make a substantial difference, but that makes sense. Here's an additional comparison between "{}" and "dict()" to confirm it:
timeit.timeit("{}", number=100_000_000) 2.1038335599987477 timeit.timeit("dict()", number=100_000_000) 10.225583500003268
Some more in-depth comparisons might be required for addition of single items to the containers. I might experiment further with this at some point in the next week or so, likely with implementing proper tests that omit the time to initialize the container. On Mon, Dec 23, 2019 at 10:09 PM Chris Angelico <rosuav@gmail.com> wrote:
On Tue, Dec 24, 2019 at 1:57 PM Kyle Stanley <aeros167@gmail.com> wrote:
Add (much faster for dicts):
timeit.timeit("s = set(); s.add(0)", number=100_000_000) 13.330938750001224 timeit.timeit("d = {}; d[0] = None", number=100_000_000) 5.788865385999088
I think this is an artifact of Python not having an empty set literal.
timeit.timeit("s = set(); s.add(0)", number=100_000_000) 13.275540543720126 timeit.timeit("d = dict(); d[0] = None", number=100_000_000) 13.044076398015022 timeit.timeit("d = {}; d[0] = None", number=100_000_000) 6.088695731014013 timeit.timeit("s = {1}; s.add(0)", number=100_000_000) 9.260965215042233 timeit.timeit("d = {1:2}; d[0] = None", number=100_000_000) 8.75433829985559
When both use the constructor call or both use a literal, the difference is far smaller. I'd call this one a wash.
ChrisA _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ODZYHNI5... Code of Conduct: http://python.org/psf/codeofconduct/
Tim Peters wrote:
Sorry! A previous attempt to reply got sent before I typed anything :-(
No worries, I only saw the link in the footer to the PSF CoC, and I was mildly concerned for a moment. My first thought was "Oh no, what did I do?". Thanks for clearing that up (:
The collision resolution strategy for sets evolved to be fancier than for dicts, to reduce cache misses. This is important for sets because the _only_ interesting thing about an element wrt a set is whether or not the set contains it. Lookup speed is everything for sets.
Huh, that's quite interesting. For some reason, I had assumed in the back of my head (without giving it much thought) that the average collision rate would be the same for set items and dict keys. Thanks for the useful information.
If you use a contiguous range of "reasonable" integers for keys, the integer hash function is perfect: there's never a collision. So any such test misses all the work Raymond did to speed set lookups. String keys have sufficiently "random" hashes to reliably create collisions, though. Cache misses can, of course, have massive effects on timing.
Ah, I forgot to consider how the hash function actually works for continuous integers. A better comparison to demonstrate the collision differences would likely use random strings. Also, I believe that max "reasonable" integer range of no collision is (-2305843009213693951, 2305843009213693951), as defined in Include/pyhash.h ( https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170a... ).
In the former case you're primarily measuring the time to look up the "add" method of sets (that's more expensive than adding 0 to the set). A better comparison would, e.g., move `add = s.add` to a setup line, and use plain "add(0)" in the loop.
Good point, there's also what Chris Angelico pointed out as well: dicts have a significant advantage in this case for having a literal for initialization. From my understanding, this ends up having a minimal performance impact in most realistic code (similar to method lookup time), but it makes a very substantial difference in this case since initialization of the containers takes up a significant portion of each iteration. I suspect my initialization comparison might also be flawed for similar reasons, but I'm glad that at least 3/5 of my comparisons were mostly reasonable. So far, the performance difference between set.update() and dict.update() were the most notable. I'll likely spend some more time experimenting in the next couple of weeks or so, and report back if I find any interesting results. In the meantime, anyone can certainly feel free to improve upon my existing comparisons. On Mon, Dec 23, 2019 at 11:03 PM Tim Peters <tim.peters@gmail.com> wrote:
Sorry! A previous attempt to reply got sent before I typed anything :-(
Very briefly:
timeit.timeit("set(i for i in range(1000))", number=100_000)
[and other examples using a range of integers]
The collision resolution strategy for sets evolved to be fancier than for dicts, to reduce cache misses. This is important for sets because the _only_ interesting thing about an element wrt a set is whether or not the set contains it. Lookup speed is everything for sets.
If you use a contiguous range of "reasonable" integers for keys, the integer hash function is perfect: there's never a collision. So any such test misses all the work Raymond did to speed set lookups. String keys have sufficiently "random" hashes to reliably create collisions, though. Cache misses can, of course, have massive effects on timing.
Add (much faster for dicts):
timeit.timeit("s = set(); s.add(0)", number=100_000_000) 13.330938750001224 timeit.timeit("d = {}; d[0] = None", number=100_000_000) 5.788865385999088
In the former case you're primarily measuring the time to look up the "add" method of sets (that's more expensive than adding 0 to the set). A better comparison would, e.g., move `add = s.add` to a setup line, and use plain "add(0)" in the loop.
That's it!
[Kyle]
... For some reason, I had assumed in the back of my head (without giving it much thought) that the average collision rate would be the same for set items and dict keys. Thanks for the useful information.
I know the theoretical number of probes for dicts, but not for sets anymore. The latter use a mix of probe strategies now, "randomish" jumps (same as for dicts) but also purely linear ("up by 1") probing to try to exploit L1 cache. It's not _apparent_ to me that the mix actually helps ;-) I just trust that Raymond timed stuff until he was sure it does.
... Ah, I forgot to consider how the hash function actually works for continuous integers. A better comparison to demonstrate the collision differences would likely use random strings.
And, at an extreme, a class with a __hash__ that always returns the same integer.
Also, I believe that max "reasonable" integer range of no collision is (-2305843009213693951, 2305843009213693951), ...
Any range that does _not_ contain both -2 and -1 (-1 is an annoying special case, with hash(-1) == hash(-2) == -2), and spans no more than sys.hash_info.modulus integers. Apart from that, the sign and magnitude of the start of the range don't matter; e.g.,
len(set(hash(i) for i in range(10**5000, 10**5000 + 1000000))) 1000000 len(set(hash(i) for i in range(-10**5000, -10**5000 + 1000000))) 1000000
Also, I believe that max "reasonable" integer range of no collision is (-2305843009213693951, 2305843009213693951), ...
Any range that does _not_ contain both -2 and -1 (-1 is an annoying special case, with hash(-1) == hash(-2) == -2), and spans no more than sys.hash_info.modulus integers. Apart from that, the sign and magnitude of the start of the range don't matter; e.g.,
len(set(hash(i) for i in range(10**5000, 10**5000 + 1000000))) 1000000 len(set(hash(i) for i in range(-10**5000, -10**5000 + 1000000))) 1000000
Sorry again! That only shows that the hash codes are unique. Dicts and sets use only the last N bits to determine the start of the probe sequence, for some value of N >= 3 (depending on the table size). So for a table of size a million, the last 20 bits (1000000 ~= 2**20) are interesting. But same thing:
MASK = (1 << 20) - 1 len(set(hash(i) & MASK for i in range(-10**5000, -10**5000 + 1000000))) 1000000 len(set(hash(i) & MASK for i in range(-10**5000, -10**5000 + 1000000))) 1000000
On 12/23/19 8:09 PM, Kyle Stanley wrote:
Chris Angelico wrote:
I think this is an artifact of Python not having an empty set literal. [snip] When both use the constructor call or both use a literal, the difference is far smaller. I'd call this one a wash.
Ah, good catch. I hadn't considered that it would make a substantial difference, but that makes sense. Here's an additional comparison between "{}" and "dict()" to confirm it:
timeit.timeit("{}", number=100_000_000) 2.1038335599987477 timeit.timeit("dict()", number=100_000_000) 10.225583500003268
We could also go the other way with it, set / dict-map-to-dontcare with one element, because that way we /can/ have a set literal: >>> timeit.timeit("set()", number=100_000_000) 8.579900023061782 >>> timeit.timeit("dict()", number=100_000_000) 10.473437276901677 >>> timeit.timeit("{0}", number=100_000_000) 5.969995185965672 >>> timeit.timeit("{0:0}", number=100_000_000) 6.24465325800702 (ran all four of those just so you see a relative sense on my laptop, which I guess is only sliiiightly slower than Kyle's) Happy holidays, //arry/
A table with these columns might be helpful for performance comparisons: [ structure, operation, big_o, keyset, # signed ints, randstrs timeit_output, system_load, architecture, # x86_84, aarch64 os, ] Does anyone have a recommendation for a tool that stores such benchmark data in JSON and merges the e.g. per-arch output? Are there URIs for Big-O notation that could be added as e.g. structured attributes in docstrings or annotations? I just added a ._repr_html_() method to the tabulate package so that, in addition to the many excellent ASCII formats tabulate prints, tabulate.tabulate(data, tablefmt='html') displays an html-escaped table in Jupyter notebooks? It's not yet released, so you'd need to `pip install -e git+ https://github.com/astanin/python-tabulate@master#egg=tabulate` On Tue, Dec 24, 2019, 3:53 AM Larry Hastings <larry@hastings.org> wrote:
On 12/23/19 8:09 PM, Kyle Stanley wrote:
Chris Angelico wrote:
I think this is an artifact of Python not having an empty set literal. [snip] When both use the constructor call or both use a literal, the difference is far smaller. I'd call this one a wash.
Ah, good catch. I hadn't considered that it would make a substantial difference, but that makes sense. Here's an additional comparison between "{}" and "dict()" to confirm it:
timeit.timeit("{}", number=100_000_000) 2.1038335599987477 timeit.timeit("dict()", number=100_000_000) 10.225583500003268
We could also go the other way with it, set / dict-map-to-dontcare with one element, because that way we *can* have a set literal:
timeit.timeit("set()", number=100_000_000) 8.579900023061782 timeit.timeit("dict()", number=100_000_000) 10.473437276901677 timeit.timeit("{0}", number=100_000_000) 5.969995185965672 timeit.timeit("{0:0}", number=100_000_000) 6.24465325800702
(ran all four of those just so you see a relative sense on my laptop, which I guess is only sliiiightly slower than Kyle's)
Happy holidays,
*/arry* _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/MQAKLPKW... Code of Conduct: http://python.org/psf/codeofconduct/
[Tim]
I know the theoretical number of probes for dicts, but not for sets anymore. The latter use a mix of probe strategies now, "randomish" jumps (same as for dicts) but also purely linear ("up by 1") probing to try to exploit L1 cache.
It's not _apparent_ to me that the mix actually helps ;-) I just trust that Raymond timed stuff until he was sure it does.
To illustrate, I contrived a case where the set optimizations clearly pay off. Output (3.8.1, Win 10 Pro, quad core box with plenty of RAM): 11184810 nice keys dict build 0.54 dict iter 0.07 dict lookup 0.81 set build 0.54 set iter 0.08 set lookup 0.82 11184810 nasty keys dict build 23.32 dict iter 0.09 dict lookup 13.26 set build 9.79 set iter 0.28 set lookup 8.46 I didn't make heroic efforts to keep the machine quiet, and there's substantial variation across runs. For example, there's no real difference beteen "0.07" and "0.08". Some things to note: - The "nice" keys act much the same regardless of whether dict or set. Dicts have an advantage for iteration "in theory" in the absence of deletions because there are no "holes" in the area storing the hash+key+value records. But for these specific keys, the same is true of sets: the initial hash probes are at indices 0, 1, 2, ..., len(the_set)-1, and there are no holes in its hash+key records either (all the empty slots are together "at the right end"). - Sets get a lot of value out of their fancier probe sequence for the nasty keys. There are lots of collisions, and sets can much more often resolve them from L1 cache. Better than a factor of 2 for build time is nothing to sneeze at. - Iteration for sets is substantially slower in this case, for two reasons: (1) sets _do_ have holes for this collection of keys; and, (2) while I don't think it's documented, the maximum load factor for sets is lower than for dicts, and 11184810 was picked to give the maximum load factor for a dict with 2**24 slots. The set in this case has twice as many slots as the dict, and all the extras are empty slots that iteration has to skip over. - I don't have a theory for why dict build time is _so_ much higher than dict lookup time for the nasty keys. - If you don't know a lot about how these things are implemented, just about everything is baffling ;-) Integer keys are important in practice, but things about how modern dicts are implemented were _driven_ by opportunities to exploit properties of the mostly-trivial int hash function. For "general" timing, it's better to use string keys, whose hash codes are "pretty randomish" (but can vary across runs). When using int keys, it's really easy to _end up_ measuring happy behaviors _specific_ to int keys. Code: from time import perf_counter as now import sys from collections import deque def disp(text, f): print(f"{text:20s} {f:5.2f}") M = sys.hash_info.modulus targetlen = 2**24 * 2 // 3 # highest load factor for dict hc = 1 badkeys = [] while len(badkeys) < targetlen: # add no more than 100 ints with hash code `hc` toadd = min(100, targetlen - len(badkeys)) badkeys.extend(hc + i * M for i in range(toadd)) hc += 713 # more than less arbitrary goodkeys = list(range(targetlen)) for tag, keys in ("nice", goodkeys), ("nasty", badkeys): print() print(len(keys), tag, "keys") for thetype, builder in ("dict", dict.fromkeys), ("set", set): s = now() d = builder(keys) f = now() disp(thetype + " build", f-s) s = now() deque(d, maxlen=0) f = now() disp(thetype + " iter", f-s) s = now() for k in keys: assert k in d f = now() disp(thetype + " lookup", f-s) del d
26.12.19 00:56, Tim Peters пише:
- I don't have a theory for why dict build time is _so_ much higher than dict lookup time for the nasty keys.
1/2 of items were copied to a new hashtable when the dict grew up. 1/4 of items were copied twice. 1/8 of items were copied three times, etc. In average every item is copied 1 time, so the build time is twice the lookup time when the cost of lookup is dominated. But the lookup benchmarking includes the overhead of iterating Python loop, which is more significant for "good" keys, thus the lookup time is larger than the build time.
[Tim]
- I don't have a theory for why dict build time is _so_ much higher than dict lookup time for the nasty keys.
To be clearer, in context this was meant to be _compared to_ the situation for sets. These were the numbers: 11184810 nasty keys dict build 23.32 dict lookup 13.26 set build 9.79 set lookup 8.46 Build time is higher than lookup time for both, which isn't surprising. But it's _much_ higher (whether in absolute or relative terms) for dicts than for sets. [Serhiy Storchaka <storchaka@gmail.com>]
1/2 of items were copied to a new hashtable when the dict grew up. 1/4 of items were copied twice. 1/8 of items were copied three times, etc. In average every item is copied 1 time, so the build time is twice the lookup time when the cost of lookup is dominated.
I agree the average total number of table insertions per element approaches 2 either way (counting insertions due to internal table growth as well as insertions visible from Python code). That's why it's expected that build time exceeds lookup time (the latter does exactly one lookup per element).
But the lookup benchmarking includes the overhead of iterating Python loop, which is more significant for "good" keys, thus the lookup time is larger than the build time.
Iteration time is too minor to care much about. Even in the worst case (sets with the nasty keys), the Python: for k in d: pass takes under a second here. Take a full second off the "set lookup" time, and dict build time still exceeds dict lookup time (whether in absolute or relative terms) much more so than for sets. But I'm not going to pursue it. The high-order bit was just to demonstrate that the set object's more complicated (than dicts) probe sequence does buy highly significant speedups - in at least one highly contrived case. Since the probe sequence changes were aimed at reducing cache misses, a full account requires comparing cache misses between the dict and set implementations. That's a mess, and slogging through it appears unlikey to result in insight.
On Tue, 24 Dec 2019 12:08:33 +0900 "Stephen J. Turnbull" <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
David Mertz writes:
Even though I was the first person in this thread to suggest collections.OrderedSet, I'm "meh" about it now. As I read more and played with the sortedcollections package, it seemed to me that while I might want a set that iterated in a determinate and meaningful order moderately often, insertion order would make up a small share of those use cases.
On the other hand, insertion order is one of the most prominent of the determinate meaningful orders where you would have to do ugly things to use "sorted" to get that order. Any application where you have an unreliable message bus feeding a queue (so that you might get duplicate objects but it's bad to process the same object twice) would be a potential application of insertion-ordered sets.
In that case you probably want a separate persistent "seen" set. Because your queue can have been drained by the time a duplicate object arrives. (which means you probably want something more efficient, such as a sequence number) Regards Antoine.
[Nick Coghlan <ncoghlan@gmail.com>]
I took Larry's request a slightly different way: he has a use case where he wants order preservation (so built in sets aren't good), but combined with low cost duplicate identification and elimination and removal of arbitrary elements (so lists and collections.deque aren't good). Organising a work queue that way seems common enough ...
Is it? I confess I haven't thought of a plausible use case. Larry didn't really explain his problem, just suggested that an ordered set would be "a solution" to it. The problem: whether there are duplicates "in the queue" is a question about an implementation detail, hard for me to translate to a question about the _task_ to be solved. For example, suppose "the task" is to make a deep copy of a web site. A "job" consists of following a URL, sucking down the page text, and adding new jobs for contained URLs on the page. We probably don't want to suck down the page text multiple times for a given URL, but checking whether a URL is currently already in the job queue is asking a question about an implementation detail that misses the point: we want to know whether that URL has already been chased, period. Whether the URL is currently in the queue is irrelevant to that. The only logical relationship is that a job that has finished _was_ in the queue at some time before the job finished. So, ya, I've seen and implemented lots of work queues along these lines - but an OrderedSet would be an "attractive nuisance" (offhandedly appearing to solve a problem it doesn't actually address): jobs = some_kind_of_queue() finished_jobs = set() ... while jobs: job = jobs.get() if job in finished_jobs: continue try: work on the job possibly adding (sub)jobs to the `jobs` queue except TransientExceptions: jobs.put(job) # try again later else: finished_jobs.add(job) Clear? There is a queue here, and a set, but they can't be combined in a useful way: the set contents have nothing to do with what's currently in the queue - the set consists of jobs that have been successfully completed; the queue doesn't care whether it contains duplicates, and merely skips over jobs that have been completed by the time the queue spits them out (which strikes me as more robust design than duplicating logic everywhere a job is added to a queue to try to prevent adding already-done jobs to begin with). Similarly, in other places I've used a "finished" flag on the object instead of a set, or a dict mapping jobs to info about job status and results. But in all these cases, the vital info associated with a job really has little to do with whether the job is currently sitting on a queue. If "is it currently on the queue?" _is_ a common question to ask, perhaps someone could give a semi-realistic example?
On Dec 27, 2019, at 19:48, Tim Peters <tim.peters@gmail.com> wrote:
So, ya, I've seen and implemented lots of work queues along these lines - but an OrderedSet would be an "attractive nuisance" (offhandedly appearing to solve a problem it doesn't actually address):
jobs = some_kind_of_queue() finished_jobs = set() ... while jobs: job = jobs.get() if job in finished_jobs: continue try: work on the job possibly adding (sub)jobs to the `jobs` queue except TransientExceptions: jobs.put(job) # try again later else: finished_jobs.add(job)
Well, if an OrderedSet were designed to gracefully handle resizes during iteration, something like this may make sense: jobs = OrderedSet(initial_jobs) for job in jobs: new_jobs = process(job) jobs |= new_jobs ... # jobs is now a set of every job processed A dictionary with None values comes close if you replace the union line with a jobs.update(new_jobs) call (and ignore resizing issues), but it breaks because repeated jobs are shuffled to the end of the sequence and would be processed again. Brandt
On 12/27/19 7:44 PM, Tim Peters wrote:
[Nick Coghlan <ncoghlan@gmail.com>]
I took Larry's request a slightly different way: he has a use case where he wants order preservation (so built in sets aren't good), but combined with low cost duplicate identification and elimination and removal of arbitrary elements (so lists and collections.deque aren't good). Organising a work queue that way seems common enough ... Is it? I confess I haven't thought of a plausible use case. Larry didn't really explain his problem, just suggested that an ordered set would be "a solution" to it.
Here is the original description of my problem, from the original email in this thread. I considered this an adequate explanation of my problem at the time. On 12/15/19 6:48 PM, Larry Hastings wrote:
I do have a use case for this. In one project I maintain a "ready" list of jobs; I need to iterate over it, but I also want fast lookup because I soemtimes remove jobs when they're subsequently marked "not ready". The existing set object would work fine here, except that it doesn't maintain insertion order. That means multiple runs of the program with the same inputs may result in running jobs in different orders, and this instability makes debugging more difficult. I've therefore switched from a set to a dict with all keys mapped to None, which provides all the set-like functionality I need.
In subsequent emails I've clarified that my workloads are small enough--and computers are fast enough--that almost any data structure would work fine for me here, e.g. a list. I don't think my needs should drive the decision making process regardless. I only described my problem to get the conversation rolling. I opine that anybody who iterates over set objects and has bugs in their code would appreciate set objects maintaining insertion order. I suspect it would make their debugging easier, because given identical inputs their set iteration would behave identically, thus making their bugs that much more stable. That's as much use case as I have for the feature. //arry/
[Larry]
Here is the original description of my problem, from the original email in this thread. I considered this an adequate explanation of my problem at the time.
I do have a use case for this. In one project I maintain a "ready" list of jobs; I need to iterate over it, but I also want fast lookup because I soemtimes remove jobs when they're subsequently marked "not ready".
Which is a description not of "the problem", but of the operations you want to perform. It's the first time in my life I've heard of a "ready list of jobs" where the programmer sometimes decides later "oh, no - this one and that one aren't ready after all" ;-) Not that it matters much - you want the operations you want. The lack of "oh - of course - that's a problem we all face at times" motivation, though, works against it being _compelling_ on its own.
... In subsequent emails I've clarified that my workloads are small enough--and computers are fast enough--that almost any data structure would work fine for me here, e.g. a list.
I don't think my needs should drive the decision making process regardless. I only described my problem to get the conversation rolling.
Which it did :-) Others went on to, explicitly or implicitly, suggest that an ordered set must also, e.g., support the entire MutableSequence interface, and even define what happens if you mutate the ordered set _while_ iterating over it (lists define the results in such cases, but ordered dicts raise an exception).
I opine that anybody who iterates over set objects and has bugs in their code would appreciate set objects maintaining insertion order. I suspect it would make their debugging easier, because given identical inputs their set iteration would behave identically, thus making their bugs that much more stable. That's as much use case as I have for the feature.
That's much stronger to me than some weird FIFO-only queue supporting fast deletion of interior elements (which - sorry - I've never had a use for). And fine by me if such a thing is added. All else being equal, I'd prefer that the builtin set type be ordered, simply because it's less surprising. and dicts are ordered now (although, as Inada Naoki said, their current implementation isn't suitable for a FIFO queue, doe primarily to O(N) time needed to find "the first" key+value record). I believe we agree that adding _some_ flavor of OrderedSet to `collections` would be a fine start. I'm just not cool with replacing the builtin set before there's an implementation fast and memory-efficient enough that _current_ heavy set users won't feel betrayed by major slowdowns or memory bloat when the switch is made. Toward that end, my opinion - which I won't defend - is that OrderedSet should not promise better than O(N)-time indexing (if it supports indexing at all), and should - like current sets and dicts - try to raise an exception if it can detect that the container has been mutated during iteration.
Hi Larry, sets (and mappings, ie dicts), have a common functionality among many languages and libraries that does Not include an ordering. Already, in CPython, there is a need to somehow indicate that insertion ordering is being used in dicts or use OrderedDict. I am quite happy with keeping sets as they are and coding any extra functionality required on top of this. This aids me as I work in many languages that have their own sets and mappings, usually without any insertion ordering. Cheers, Paddy. On Sun, Dec 29, 2019, 12:07 AM Larry Hastings <larry@hastings.org> wrote:
On 12/27/19 7:44 PM, Tim Peters wrote:
[Nick Coghlan <ncoghlan@gmail.com> <ncoghlan@gmail.com>]
I took Larry's request a slightly different way: he has a use case where he wants order preservation (so built in sets aren't good), but combined with low cost duplicate identification and elimination and removal of arbitrary elements (so lists and collections.deque aren't good). Organising a work queue that way seems common enough ...
Is it? I confess I haven't thought of a plausible use case. Larry didn't really explain his problem, just suggested that an ordered set would be "a solution" to it.
Here is the original description of my problem, from the original email in this thread. I considered this an adequate explanation of my problem at the time.
On 12/15/19 6:48 PM, Larry Hastings wrote:
I do have a use case for this. In one project I maintain a "ready" list of jobs; I need to iterate over it, but I also want fast lookup because I soemtimes remove jobs when they're subsequently marked "not ready". The existing set object would work fine here, except that it doesn't maintain insertion order. That means multiple runs of the program with the same inputs may result in running jobs in different orders, and this instability makes debugging more difficult. I've therefore switched from a set to a dict with all keys mapped to None, which provides all the set-like functionality I need.
In subsequent emails I've clarified that my workloads are small enough--and computers are fast enough--that almost any data structure would work fine for me here, e.g. a list. I don't think my needs should drive the decision making process regardless. I only described my problem to get the conversation rolling.
I opine that anybody who iterates over set objects and has bugs in their code would appreciate set objects maintaining insertion order. I suspect it would make their debugging easier, because given identical inputs their set iteration would behave identically, thus making their bugs that much more stable. That's as much use case as I have for the feature.
*/arry* _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/XWTP7OD4... Code of Conduct: http://python.org/psf/codeofconduct/
On 12/28/19 6:24 PM, Tim Peters wrote:
[Larry]
Here is the original description of my problem, from the original email in this thread. I considered this an adequate explanation of my problem at the time.
I do have a use case for this. In one project I maintain a "ready" list of jobs; I need to iterate over it, but I also want fast lookup because I soemtimes remove jobs when they're subsequently marked "not ready". Which is a description not of "the problem", but of the operations you want to perform. It's the first time in my life I've heard of a "ready list of jobs" where the programmer sometimes decides later "oh, no - this one and that one aren't ready after all" ;-)
It's a lightweight abstract dependency graph. Its nodes are opaque, only required to be hashable. And it doesn't require that you give it all the nodes in strict dependency order. When you add a node, you can also optionally specify dependencies, and those dependencies aren't required to be nodes that the graph has previously seen. So you can add a node A that depends on B and C, without showing it B or C first. When that happens it creates placeholder nodes for B and C, and naturally these nodes have no dependencies. You can then later inform the graph that node B has dependencies on X Y and Z. (My specific use case is a package manager. Packages are identified with unique strings. So the nodes I give the give the graph are simply those package names. It's pretty common for me to get a package with dependencies on packages I haven't seen yet. Designing the graph to create placeholders let me skip making two passes over the package list, pre-creating the package objects, etc etc. This made the code simpler and presumably faster.) The graph API maintains an externally-visible "ready" iterable of nodes. And since B can go from ready to not-ready, it can get added to "ready" and subsequently removed. Also, when a node is satisfied, you simply remove it from the graph. If A depends on B and C, and they all represent jobs, and B and C have no dependencies, they'll be in the "ready" iterable. You iterate over "ready", and execute those jobs, and assuming they are successful you then remove them from the graph. When A's dependencies are all satisfied, it'll get added to the "ready" iterable. And naturally when B and C were removed from the graph, they were removed from the "ready" iterable too. Thus it's this "ready" collection that I want to a) be iterable, b) be stable, and c) support fast removal. I add and remove nodes from that set a lot, which I realized would perturb the iteration order from run to run, etc etc etc, and here we are. I grant you maybe it's a weird approach. But after two false starts, where I was starting to boil the oceans with ABCs and "node is satisfied" APis and separate iterator objects, I had a re-think and hit on this super lightweight design. I'm actually quite pleased with it--it's short, it has a minimal API, it's readable, it was easy to get right, and it solves my problem. Happy new year, //arry/
[Larry]
It's a lightweight abstract dependency graph. Its nodes are opaque, only required to be hashable. And it doesn't require that you give it all the nodes in strict dependency order.
When you add a node, you can also optionally specify dependencies, and those dependencies aren't required to be nodes that the graph has previously seen. So you can add a node A that depends on B and C, without showing it B or C first. When that happens it creates placeholder nodes for B and C, and naturally these nodes have no dependencies. You can then later inform the graph that node B has dependencies on X Y and Z.
(My specific use case is a package manager. Packages are identified with unique strings. So the nodes I give the give the graph are simply those package names. It's pretty common for me to get a package with dependencies on packages I haven't seen yet. Designing the graph to create placeholders let me skip making two passes over the package list, pre-creating the package objects, etc etc. This made the code simpler and presumably faster.)
The graph API maintains an externally-visible "ready" iterable of nodes. And since B can go from ready to not-ready, it can get added to "ready" and subsequently removed.
Also, when a node is satisfied, you simply remove it from the graph. If A depends on B and C, and they all represent jobs, and B and C have no dependencies, they'll be in the "ready" iterable. You iterate over "ready", and execute those jobs, and assuming they are successful you then remove them from the graph. When A's dependencies are all satisfied, it'll get added to the "ready" iterable. And naturally when B and C were removed from the graph, they were removed from the "ready" iterable too.
OK! You're doing a topological sort. There are natural & simple ways to do that right now that scale efficiently to very large graphs (time & space linear in the sum of the number of nodes and the number of dependencies). Curiously, the difficulties are mostly driven by the quality of _error_ messages you want (in case of, e.g., cycles in the dependency graph). Lots of those implementations became deterministic "by magic" when ordered dicts were introduced. This is why: a bare bones implementation needs to associate two pieces of info with each node: a list of its immediate successors, and an integer count of the number of its immediate predecessors. A dict is the natural way to implement that association. When all the dependency info has been entered, then: - First all nodes are emitted whose predecessor count is 0. Iterating over the association dict once is the natural way to find them, and that order is defined now. - Then, as each emitted node N is marked done: for child in N.successors: assert child.predcount > 0 child.predcount -= 1 if child.predcount == 0: emit(child) That's about all there is to it. But there's no end to the cruft that _can_ be added to, e.g., verify that a node is marked done no more than once, or to compute & display strongly connected components if not all nodes are emitted, etc. Ordered sets could improve this "natural" implementation if the successor lists were successor sets instead, simply because then - like lists - their iteration order would be defined, which can in turn affect the order in which nodes are emitted in the loop just above. But lists are adequate to the task, are cheaper in both time and space, and can't _not_ be ordered ;-)
Thus it's this "ready" collection that I want to a) be iterable, b) be stable, and c) support fast removal. I add and remove nodes from that set a lot, which I realized would perturb the iteration order from run to run, etc etc etc, and here we are.
The sketch above (which is a bog-common way to implement topsorts) doesn't build a data structure recording the final order, and, in fact, never deletes anything. You might protest that the initial iteration step (scanning the association dict to find nodes with no predecessors) is an expense you skip because nodes with predecessors are deleted from your "ready" structure all along. But nodes with predecessors are _touched_ either way, and merely skipping over them is bound to be cheaper than deleting them. After that initial step, no search of any kind is needed again.
I grant you maybe it's a weird approach. But after two false starts, where I was starting to boil the oceans with ABCs and "node is satisfied" APis and separate iterator objects, I had a re-think and hit on this super lightweight design. I'm actually quite pleased with it--it's short, it has a minimal API, it's readable, it was easy to get right, and it solves my problem.
Whereas I would have started with a topsort and finished it while I was sleeping ;-) Seriously, I've written such a thing many times, but never reuse the code because it's so easy to redo from scratch. Every application has _some_ unique twist, which makes a general-purpose API so bloated it's harder to figure out how to use than to write custom code. In your application, I'm guessing (but don't know) that when a package name is emitted, it's possible you're not able to find the package, and so the process has to stop. That's why I inserted a "as each emitted node N is marked done" step to my description. That is, I'm picturing an API that adds a (say) topsorter.record_succeeded(node) method to say that - yes - the node was processed successfully. Else, by default, its successors (if any) must not be emitted. But if that isn't needed, the whole topsort order can be materialized into (say) a list in one gulp, or delivered by a generator.
Happy new year,
And to you too! :-)
Due to version and platform constraints, a SAT solver is necessary for conda and (now) pip. Libsolv is one of the fastest around. https://github.com/QuantStack/mamba is conda implemented with libsolv (and parallel downloading of *declarative* dependency metadata). For general purpose graphs with implicit node instantation from edge declaration, NetworkX has implementations of many graph algorithms. https://networkx.github.io/documentation/stable/reference/algorithms/generat... CuGraph (a product of the RAPIDS.ai project) does graphs with CUDA (from Python) with a "NetworkX-like API" but doesn't yet have a topological sort (though it does have BFS) https://github.com/rapidsai/cugraph "pip needs a dependency resolver" + End (Fn+Right) links to the latest work on the new pip code (that must require declarative dependency metadata) https://github.com/pypa/pip/issues/988 That said, this implementation of topo sort could have a deterministic output given an OrderedSet: https://rosettacode.org/wiki/Topological_sort#Python A table including Big-O and memory usage for the necessary data structure and methods would be useful. On Sun, Dec 29, 2019, 5:12 PM Tim Peters <tim.peters@gmail.com> wrote:
[Larry]
It's a lightweight abstract dependency graph. Its nodes are opaque, only required to be hashable. And it doesn't require that you give it all the nodes in strict dependency order.
When you add a node, you can also optionally specify dependencies, and those dependencies aren't required to be nodes that the graph has previously seen. So you can add a node A that depends on B and C, without showing it B or C first. When that happens it creates placeholder nodes for B and C, and naturally these nodes have no dependencies. You can then later inform the graph that node B has dependencies on X Y and Z.
(My specific use case is a package manager. Packages are identified with unique strings. So the nodes I give the give the graph are simply those package names. It's pretty common for me to get a package with dependencies on packages I haven't seen yet. Designing the graph to create placeholders let me skip making two passes over the package list, pre-creating the package objects, etc etc. This made the code simpler and presumably faster.)
The graph API maintains an externally-visible "ready" iterable of nodes. And since B can go from ready to not-ready, it can get added to "ready" and subsequently removed.
Also, when a node is satisfied, you simply remove it from the graph. If A depends on B and C, and they all represent jobs, and B and C have no dependencies, they'll be in the "ready" iterable. You iterate over "ready", and execute those jobs, and assuming they are successful you then remove them from the graph. When A's dependencies are all satisfied, it'll get added to the "ready" iterable. And naturally when B and C were removed from the graph, they were removed from the "ready" iterable too.
OK! You're doing a topological sort. There are natural & simple ways to do that right now that scale efficiently to very large graphs (time & space linear in the sum of the number of nodes and the number of dependencies). Curiously, the difficulties are mostly driven by the quality of _error_ messages you want (in case of, e.g., cycles in the dependency graph).
Lots of those implementations became deterministic "by magic" when ordered dicts were introduced. This is why: a bare bones implementation needs to associate two pieces of info with each node: a list of its immediate successors, and an integer count of the number of its immediate predecessors. A dict is the natural way to implement that association.
When all the dependency info has been entered, then:
- First all nodes are emitted whose predecessor count is 0. Iterating over the association dict once is the natural way to find them, and that order is defined now.
- Then, as each emitted node N is marked done: for child in N.successors: assert child.predcount > 0 child.predcount -= 1 if child.predcount == 0: emit(child)
That's about all there is to it. But there's no end to the cruft that _can_ be added to, e.g., verify that a node is marked done no more than once, or to compute & display strongly connected components if not all nodes are emitted, etc.
Ordered sets could improve this "natural" implementation if the successor lists were successor sets instead, simply because then - like lists - their iteration order would be defined, which can in turn affect the order in which nodes are emitted in the loop just above. But lists are adequate to the task, are cheaper in both time and space, and can't _not_ be ordered ;-)
Thus it's this "ready" collection that I want to a) be iterable, b) be stable, and c) support fast removal. I add and remove nodes from that set a lot, which I realized would perturb the iteration order from run to run, etc etc etc, and here we are.
The sketch above (which is a bog-common way to implement topsorts) doesn't build a data structure recording the final order, and, in fact, never deletes anything. You might protest that the initial iteration step (scanning the association dict to find nodes with no predecessors) is an expense you skip because nodes with predecessors are deleted from your "ready" structure all along. But nodes with predecessors are _touched_ either way, and merely skipping over them is bound to be cheaper than deleting them.
After that initial step, no search of any kind is needed again.
I grant you maybe it's a weird approach. But after two false starts, where I was starting to boil the oceans with ABCs and "node is satisfied" APis and separate iterator objects, I had a re-think and hit on this super lightweight design. I'm actually quite pleased with it--it's short, it has a minimal API, it's readable, it was easy to get right, and it solves my problem.
Whereas I would have started with a topsort and finished it while I was sleeping ;-) Seriously, I've written such a thing many times, but never reuse the code because it's so easy to redo from scratch. Every application has _some_ unique twist, which makes a general-purpose API so bloated it's harder to figure out how to use than to write custom code.
In your application, I'm guessing (but don't know) that when a package name is emitted, it's possible you're not able to find the package, and so the process has to stop. That's why I inserted a "as each emitted node N is marked done" step to my description. That is, I'm picturing an API that adds a (say)
topsorter.record_succeeded(node)
method to say that - yes - the node was processed successfully. Else, by default, its successors (if any) must not be emitted.
But if that isn't needed, the whole topsort order can be materialized into (say) a list in one gulp, or delivered by a generator.
Happy new year,
And to you too! :-) _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/HRKQPTPQ... Code of Conduct: http://python.org/psf/codeofconduct/
A set was requested with preserved insertion order. Mathematically, such a structure relates more to an Oset (having order) than a Set. See relationships in the image below: [image: libo.png] Each of the mentioned structure types has a dedicated role. Python's sets work well and correlate with existing math principles. I would be more convinced to introduce a new collection to Python rather than alter existing sets. -1: preserving insertion-order in sets 0: adding `OrderedSet` (where `dict.fromkeys()` does not suffice, i.e. needing set operations) +1: implementing an analogous abstract collection, e.g. "`HashList`" -> Oset, just as `Counter` -> Mset On Sun, Dec 29, 2019 at 9:00 PM Wes Turner <wes.turner@gmail.com> wrote:
Due to version and platform constraints, a SAT solver is necessary for conda and (now) pip. Libsolv is one of the fastest around.
https://github.com/QuantStack/mamba is conda implemented with libsolv (and parallel downloading of *declarative* dependency metadata).
For general purpose graphs with implicit node instantation from edge declaration, NetworkX has implementations of many graph algorithms.
https://networkx.github.io/documentation/stable/reference/algorithms/generat...
CuGraph (a product of the RAPIDS.ai project) does graphs with CUDA (from Python) with a "NetworkX-like API" but doesn't yet have a topological sort (though it does have BFS) https://github.com/rapidsai/cugraph
"pip needs a dependency resolver" + End (Fn+Right) links to the latest work on the new pip code (that must require declarative dependency metadata) https://github.com/pypa/pip/issues/988
That said, this implementation of topo sort could have a deterministic output given an OrderedSet: https://rosettacode.org/wiki/Topological_sort#Python
A table including Big-O and memory usage for the necessary data structure and methods would be useful.
On Sun, Dec 29, 2019, 5:12 PM Tim Peters <tim.peters@gmail.com> wrote:
[Larry]
It's a lightweight abstract dependency graph. Its nodes are opaque, only required to be hashable. And it doesn't require that you give it all the nodes in strict dependency order.
When you add a node, you can also optionally specify dependencies, and those dependencies aren't required to be nodes that the graph has previously seen. So you can add a node A that depends on B and C, without showing it B or C first. When that happens it creates placeholder nodes for B and C, and naturally these nodes have no dependencies. You can then later inform the graph that node B has dependencies on X Y and Z.
(My specific use case is a package manager. Packages are identified with unique strings. So the nodes I give the give the graph are simply those package names. It's pretty common for me to get a package with dependencies on packages I haven't seen yet. Designing the graph to create placeholders let me skip making two passes over the package list, pre-creating the package objects, etc etc. This made the code simpler and presumably faster.)
The graph API maintains an externally-visible "ready" iterable of nodes. And since B can go from ready to not-ready, it can get added to "ready" and subsequently removed.
Also, when a node is satisfied, you simply remove it from the graph. If A depends on B and C, and they all represent jobs, and B and C have no dependencies, they'll be in the "ready" iterable. You iterate over "ready", and execute those jobs, and assuming they are successful you then remove them from the graph. When A's dependencies are all satisfied, it'll get added to the "ready" iterable. And naturally when B and C were removed from the graph, they were removed from the "ready" iterable too.
OK! You're doing a topological sort. There are natural & simple ways to do that right now that scale efficiently to very large graphs (time & space linear in the sum of the number of nodes and the number of dependencies). Curiously, the difficulties are mostly driven by the quality of _error_ messages you want (in case of, e.g., cycles in the dependency graph).
Lots of those implementations became deterministic "by magic" when ordered dicts were introduced. This is why: a bare bones implementation needs to associate two pieces of info with each node: a list of its immediate successors, and an integer count of the number of its immediate predecessors. A dict is the natural way to implement that association.
When all the dependency info has been entered, then:
- First all nodes are emitted whose predecessor count is 0. Iterating over the association dict once is the natural way to find them, and that order is defined now.
- Then, as each emitted node N is marked done: for child in N.successors: assert child.predcount > 0 child.predcount -= 1 if child.predcount == 0: emit(child)
That's about all there is to it. But there's no end to the cruft that _can_ be added to, e.g., verify that a node is marked done no more than once, or to compute & display strongly connected components if not all nodes are emitted, etc.
Ordered sets could improve this "natural" implementation if the successor lists were successor sets instead, simply because then - like lists - their iteration order would be defined, which can in turn affect the order in which nodes are emitted in the loop just above. But lists are adequate to the task, are cheaper in both time and space, and can't _not_ be ordered ;-)
Thus it's this "ready" collection that I want to a) be iterable, b) be stable, and c) support fast removal. I add and remove nodes from that set a lot, which I realized would perturb the iteration order from run to run, etc etc etc, and here we are.
The sketch above (which is a bog-common way to implement topsorts) doesn't build a data structure recording the final order, and, in fact, never deletes anything. You might protest that the initial iteration step (scanning the association dict to find nodes with no predecessors) is an expense you skip because nodes with predecessors are deleted from your "ready" structure all along. But nodes with predecessors are _touched_ either way, and merely skipping over them is bound to be cheaper than deleting them.
After that initial step, no search of any kind is needed again.
I grant you maybe it's a weird approach. But after two false starts, where I was starting to boil the oceans with ABCs and "node is satisfied" APis and separate iterator objects, I had a re-think and hit on this super lightweight design. I'm actually quite pleased with it--it's short, it has a minimal API, it's readable, it was easy to get right, and it solves my problem.
Whereas I would have started with a topsort and finished it while I was sleeping ;-) Seriously, I've written such a thing many times, but never reuse the code because it's so easy to redo from scratch. Every application has _some_ unique twist, which makes a general-purpose API so bloated it's harder to figure out how to use than to write custom code.
In your application, I'm guessing (but don't know) that when a package name is emitted, it's possible you're not able to find the package, and so the process has to stop. That's why I inserted a "as each emitted node N is marked done" step to my description. That is, I'm picturing an API that adds a (say)
topsorter.record_succeeded(node)
method to say that - yes - the node was processed successfully. Else, by default, its successors (if any) must not be emitted.
But if that isn't needed, the whole topsort order can be materialized into (say) a list in one gulp, or delivered by a generator.
Happy new year,
And to you too! :-) _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/HRKQPTPQ... Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/6UKXR63M... Code of Conduct: http://python.org/psf/codeofconduct/
The only `OrderedSet` use I have seen in the wild is https://github.com/python-hyper/uritemplate/search?q=orderedset&unscoped_q=orderedset .
The only `OrderedSet` use I have seen in the wild is https://github.com/python-hyper/uritemplate/search?q=orderedset&unscoped_q=orderedset .
A more generalized Python code search across GitHub of "orderedset" returns ~500k results: https://github.com/search?l=Python&q=orderedset&type=Code . Of course, a general code search across GitHub is by no means a clear or reliable measure of actual usage, considering the number of duplicate results. But, it's still useful for rough estimations. On Thu, Jan 2, 2020 at 7:08 PM Brett Cannon <brett@python.org> wrote:
The only `OrderedSet` use I have seen in the wild is https://github.com/python-hyper/uritemplate/search?q=orderedset&unscoped_q=orderedset . _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/GYIOTCKM... Code of Conduct: http://python.org/psf/codeofconduct/
Le mer. 8 janv. 2020 à 07:02, Kyle Stanley <aeros167@gmail.com> a écrit :
A more generalized Python code search across GitHub of "orderedset" returns ~500k results: https://github.com/search?l=Python&q=orderedset&type=Code .
Sadly this search seems to count how my projects put their virtual environment on GitHub. Example of result: venv/Lib/site-packages/setuptools/_vendor/ordered_set.py It's a vendored copy of the https://pypi.org/project/ordered-set/ project used by setuptools. Libraries.io counts 100 repositories and 20 packages which depend on ordered-set: https://libraries.io/pypi/ordered-set/dependents Victor
participants (26)
-
Antoine Pitrou
-
Barry Warsaw
-
Brandt Bucher
-
Brett Cannon
-
Chris Angelico
-
David Cuthbert
-
David Mertz
-
Guido van Rossum
-
Inada Naoki
-
Ivan Levkivskyi
-
Kyle Stanley
-
Larry Hastings
-
Nick Coghlan
-
Paddy McCarthy
-
Paul Moore
-
Peter Wang
-
Petr Viktorin
-
pylang
-
Raymond Hettinger
-
Serhiy Storchaka
-
Stephen J. Turnbull
-
Steve Dower
-
Steven D'Aprano
-
Tim Peters
-
Victor Stinner
-
Wes Turner