giving set.add a return value
Hey all, Often I've found this kind of code: seen = set() for i in iterable: if i in seen: ... # do something in case of duplicates else: seen.add(i) ... # do something in case of first visit This kind of code appears whenever one needs to check for duplicates in case of a user-submitted iterable, or when we loop over a recursive iteration that may involve cycles (graph search or the like). This code could be improved if one could ensure an item is in the set, and get whether it was there before in one operation. This may seem overly specific, but dicts do do this: seen = {} for i in iterable: if seen.set_default(i, some_value) is not None: ... # do something in case of duplicates else: ... # do something in case of first visit I think the set type would benefit greatly from its add method having a return value. set.add would return True if the item was already in the set prior to insertion, and False otherwise. Looking at the Cpython code, the set_add_entry already detects existing entries, adding a return value would require no additional complexity. Any thoughts?
Personally, I'd want to see mutator methods return `self` so you can do more than one mutation in a statement, but the convention is that all the mutator methods return `None`. On Thu, Jun 25, 2020, 10:28 AM Ben Avrahami <avrahami.ben@gmail.com> wrote:
Hey all, Often I've found this kind of code:
seen = set() for i in iterable: if i in seen: ... # do something in case of duplicates else: seen.add(i) ... # do something in case of first visit
This kind of code appears whenever one needs to check for duplicates in case of a user-submitted iterable, or when we loop over a recursive iteration that may involve cycles (graph search or the like). This code could be improved if one could ensure an item is in the set, and get whether it was there before in one operation. This may seem overly specific, but dicts do do this:
seen = {} for i in iterable: if seen.set_default(i, some_value) is not None: ... # do something in case of duplicates else: ... # do something in case of first visit
I think the set type would benefit greatly from its add method having a return value. set.add would return True if the item was already in the set prior to insertion, and False otherwise.
Looking at the Cpython code, the set_add_entry already detects existing entries, adding a return value would require no additional complexity.
Any thoughts? _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/6WYNYN... Code of Conduct: http://python.org/psf/codeofconduct/
On 26/06/20 2:45 am, Steele Farnsworth wrote:
Personally, I'd want to see mutator methods return `self` so you can do more than one mutation in a statement, but the convention is that all the mutator methods return `None`.
I would say the convention is that mutator methods don't return 'self'. That doesn't preclude them from returning some other useful value. -- Greg
Previous discussions on this: https://mail.python.org/archives/list/python-ideas@python.org/thread/ASHOHN3... https://mail.python.org/archives/list/python-ideas@python.org/thread/K5SS62A... (seems like part of the above discussion that got separated) https://mail.python.org/archives/list/python-ideas@python.org/thread/CKF2GFI... On Thu, Jun 25, 2020 at 4:30 PM Ben Avrahami <avrahami.ben@gmail.com> wrote:
Hey all, Often I've found this kind of code:
seen = set() for i in iterable: if i in seen: ... # do something in case of duplicates else: seen.add(i) ... # do something in case of first visit
This kind of code appears whenever one needs to check for duplicates in case of a user-submitted iterable, or when we loop over a recursive iteration that may involve cycles (graph search or the like). This code could be improved if one could ensure an item is in the set, and get whether it was there before in one operation. This may seem overly specific, but dicts do do this:
seen = {} for i in iterable: if seen.set_default(i, some_value) is not None: ... # do something in case of duplicates else: ... # do something in case of first visit
I think the set type would benefit greatly from its add method having a return value. set.add would return True if the item was already in the set prior to insertion, and False otherwise.
Looking at the Cpython code, the set_add_entry already detects existing entries, adding a return value would require no additional complexity.
Any thoughts? _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/6WYNYN... Code of Conduct: http://python.org/psf/codeofconduct/
On Thu, Jun 25, 2020 at 6:44 PM Alex Hall <alex.mojaki@gmail.com> wrote:
Previous discussions on this:
https://mail.python.org/archives/list/python-ideas@python.org/thread/ASHOHN3...
https://mail.python.org/archives/list/python-ideas@python.org/thread/K5SS62A... (seems like part of the above discussion that got separated)
https://mail.python.org/archives/list/python-ideas@python.org/thread/CKF2GFI...
The latest of these discussions is over 8 years old, and I think it deserves bringing up again. The convention that "methods that modify self return None" is useful, but I would say that dict's equivalent method: dict.setdefault is a modifying method that also returns a pertinent (and often useful value), even though it could too be split into two methods (__getitem__ and __setitem__ if the key is missing), as does dict.pop (BTW, I think that set.discard should also return a boolean, but that discussion can wait). Indeed, adding a return value to set.add would not improve runtime performance, but I believe it would both improve readability and make code more fluent to write. As for teaching, I don't think this additional bit of complexity would be too difficult to learn. "set.add ensures that an item is in the set and returns whether it was already included" seems pretty straightforward to me. Not to mention that anyone coming from C#, Java, C++, and many other languages would already be familiar with this concept. Overall, I think my gripe is that, currently, set() does not have the capabilities of a dict that maps to NoneType, that the simplest implementation of set in python is more powerful than the standard library's version.
My point was only that, as far as I know, all the methods for built in container types that serve only to change what is contained return None, and that this was an intentional design choice, so changing it in one case would have to evoke a larger discussion about what those sorts of methods should return. I wouldn't be opposed to that discussion happening and for any changes that are made to happen within 3.x because I doubt that very much code that currently exists depends on these methods returning None or even use what they return at all. On Thu, Jun 25, 2020, 10:28 AM Ben Avrahami <avrahami.ben@gmail.com> wrote:
Hey all, Often I've found this kind of code:
seen = set() for i in iterable: if i in seen: ... # do something in case of duplicates else: seen.add(i) ... # do something in case of first visit
This kind of code appears whenever one needs to check for duplicates in case of a user-submitted iterable, or when we loop over a recursive iteration that may involve cycles (graph search or the like). This code could be improved if one could ensure an item is in the set, and get whether it was there before in one operation. This may seem overly specific, but dicts do do this:
seen = {} for i in iterable: if seen.set_default(i, some_value) is not None: ... # do something in case of duplicates else: ... # do something in case of first visit
I think the set type would benefit greatly from its add method having a return value. set.add would return True if the item was already in the set prior to insertion, and False otherwise.
Looking at the Cpython code, the set_add_entry already detects existing entries, adding a return value would require no additional complexity.
Any thoughts? _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/6WYNYN... Code of Conduct: http://python.org/psf/codeofconduct/
On Thu, Jun 25, 2020 at 9:02 AM Steele Farnsworth <swfarnsworth@gmail.com> wrote:
My point was only that, as far as I know, all the methods for built in container types that serve only to change what is contained return None, and that this was an intentional design choice, so changing it in one case would have to evoke a larger discussion about what those sorts of methods should return.
indeed -- and that is pretty darn baked in to Python, so I don't think it's going to change. Note: I"n not sure your example with setdefault is correct: seen = {} for i in iterable: if seen.set_default(i, some_value) is not None: ... # do something in case of duplicates else: ... # do something in case of first visit A) you spelled setdefault() wrong -- no underscore B) seen.setdefault(i, some_value) will return some_value if it's not there, and whatever the value it is it is (in this case, starting with an empty dict, it will be some_value always. Running this code: some_value = "sentinel" iterable = [3, 2, 4, 2, 3] seen = {} for i in iterable: if seen.setdefault(i, some_value) is not None: # do something in case of duplicates print(f'{i} was already in there') else: # do something in case of first visit print(f'{i} was not already there') results in: In [9]: run in_set.py 3 was already in there 2 was already in there 4 was already in there 2 was already in there 3 was already in there so not working. But you can make it work if you reset the value: some_value = "sentinel" iterable = [3, 2, 4, 2, 3] seen = {} for i in iterable: if seen.setdefault(i, None) is not None: # do something in case of duplicates print(f'{i} was already in there') else: # do something in case of first visit print(f'{i} was not already there') seen[i] = some_value In [11]: run in_set.py 3 was not already there 2 was not already there 4 was not already there 2 was already in there 3 was already in there But this is a bit klunky as well, not really any better than the set version. However, for the case at hand, adding a method similar to the dict.setdefault() would be a reasonable thing to do. I'm not sure what to call it, or what the API should be, but maybe: class my_set(set): def add_if_not_there(self, item): if item in self: return True else: self.add(item) return False seen = my_set() for i in iterable: if seen.add_if_not_there(i): print(f'{i} was already in there') else: print(f'{i} was not already there') However, while dict.setdefault does clean up and clarify otherwise somewhat ugly code, I'm not sure this is that much better than: for i in iterable: if i in seen: print(f'{i} was already in there') else: seen.add(i) print(f'{i} was not already there') But feel free to make the case :-) Note that setdefault is in the MutableMapping ABC, so there could be some debate about whether to add this new method to the MutableSet ABC. -CHB
I wouldn't be opposed to that discussion happening and for any changes that are made to happen within 3.x because I doubt that very much code that currently exists depends on these methods returning None or even use what they return at all.
On Thu, Jun 25, 2020, 10:28 AM Ben Avrahami <avrahami.ben@gmail.com> wrote:
Hey all, Often I've found this kind of code:
seen = set() for i in iterable: if i in seen: ... # do something in case of duplicates else: seen.add(i) ... # do something in case of first visit
This kind of code appears whenever one needs to check for duplicates in case of a user-submitted iterable, or when we loop over a recursive iteration that may involve cycles (graph search or the like). This code could be improved if one could ensure an item is in the set, and get whether it was there before in one operation. This may seem overly specific, but dicts do do this:
seen = {} for i in iterable: if seen.set_default(i, some_value) is not None: ... # do something in case of duplicates else: ... # do something in case of first visit
I think the set type would benefit greatly from its add method having a return value. set.add would return True if the item was already in the set prior to insertion, and False otherwise.
Looking at the Cpython code, the set_add_entry already detects existing entries, adding a return value would require no additional complexity.
Any thoughts? _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/6WYNYN... Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/ILNANL... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Thu, Jun 25, 2020 at 7:55 PM Christopher Barker <pythonchb@gmail.com> wrote:
On Thu, Jun 25, 2020 at 9:02 AM Steele Farnsworth <swfarnsworth@gmail.com> wrote: indeed -- and that is pretty darn baked in to Python, so I don't think it's going to change.
Except this convention doesn't hold for dict.setvalue (which I did misspell, sorry), or for dict.pop. Both these methods fundamentally mutate the collection, but they also return a value (which could be retrieved by a non-mutating method), that somewhat pertains to the operation performed.
Note: I"n not sure your example with setdefault is correct:
Fair enough, I misspelled setdefault and messed up the example, what I meant was: seen = {} for k, v in iterable_of_tuples: if seen.setdefault(k,v) is not v: ... # duplicate key else: ... # new key
On Thu, Jun 25, 2020 at 10:10 AM Ben Avrahami <avrahami.ben@gmail.com> wrote:
On Thu, Jun 25, 2020 at 7:55 PM Christopher Barker <pythonchb@gmail.com> wrote:
On Thu, Jun 25, 2020 at 9:02 AM Steele Farnsworth <swfarnsworth@gmail.com> wrote: indeed -- and that is pretty darn baked in to Python, so I don't think it's going to change.
30 years of history beats out 8 years since the last discussion of this. 😉
Except this convention doesn't hold for dict.setvalue (which I did misspell, sorry), or for dict.pop.
I don't know what you mean by setvalue(), but I'm assuming you meant setdefault(). If that's true, then the guideline isn't "all mutating methods must return None", it's "*unless* the method is specifically designed to have a return value based on the purpose of the method, the method should return None".
Both these methods fundamentally mutate the collection, but they also return a value (which could be retrieved by a non-mutating method), that somewhat pertains to the operation performed.
Both pop() and setdefault() are meant to return something, they just also happen to mutate things. You're wanting the reverse and that isn't how Python has chosen to go. Same with people who bring up wanting a fluent API where 'self' is always returned. It's a very specific decision that for methods that do nothing but mutation they return None to signify that you're not getting back a new object, but something that mutated in-place. This is why list.sort(), for instance, returns None; you are not getting a new copy of the list sorted, you actually changed the original object which you should have named with a variable. That way you have to access the object by the name you already gave it as a reminder its the same object. Plus Python isn't big on the crazy one-lineers where you might want to do an inline call of add() to simply save yourself one line of code. Explicit is better than implicit, and that can mean needing to simply do something in two lines instead of one.
Hi Ben You've definitely got a point. When you do seen.add(key) it's sometimes important to know if this changes 'seen'. Here's a one-liner that does what you want: key in seen or seen.add(key) It's a bit obscure, and perhaps a bit too much https://en.wikipedia.org/wiki/Code_golf. So here's a worked example. >>> seen = set() >>> 0 in seen or seen.add(0) >>> 0 in seen or seen.add(0) True The first time '0 in seen' is False, so the OR part 'seen.add(0)' is executed. As you and others have said, seen.add(key) always return None. (The REPL does not echo the expression value if it is None.) The second time '0 in seen' is True, so the OR part is not done, and the value of the expression is True. If you want, in your own code you can subclass set to provide the behaviour you think is missing. Or just write a utility function. I think that would in practice be better. >>> def in_or_add(s, key): ... return key in s or s.add(key) or False >>> seen = set() >>> in_or_add(seen, 0) False >>> in_or_add(seen, 0) True An aside. Your proposal is:
value. set.add would return True if the item was already in the set prior to insertion, and False otherwise.
This allows you to simplify your code example. But I find the semantics a bit odd. If set.add(key) were to return a boolean, I'd expect it to be True if the item genuinely was added. But your semantics are the other way around. My proposal of in_or_add and your revision of set.add both have the same semantics. I think that is clear. But they have different names. If enough people use a utility function such as in_or_add(s, key), then that makes a good case for adding it as a set method. Two final points. While processing the 'not key in seen' branch, an exception might occur. Sometimes it's better to add the key to 'seen' once all the associated processing is complete. (An exception that leaves the system in an inconsistent state can cause problems later.) The second point is that I'm not confident about 'in_or_add' as a name for the method. Once again, thank you for your interesting idea. -- Jonathan
On 26/06/20 4:01 am, Steele Farnsworth wrote:
My point was only that, as far as I know, all the methods for built in container types that serve only to change what is contained return None, and that this was an intentional design choice, so changing it in one case would have to evoke a larger discussion about what those sorts of methods should return.
The reason for that convention is so that there is no confusion about which methods return new objects and which modify the object in place. However, achieving that goal only requires that mutating methods don't return 'self'. It doesn't mean they can't return some other value that might be useful. A wider argument against that might be that methods should be classifiable into "procedures" (that have side effects but not return values) and "functions" (that return values but don't have side effects). I'm not sure whether that's considered part of the Python design philosophy -- I don't remember seeing much discussion about it. In this particular case, it might be better to add a new method rather than redefining the existing 'add' method, because if code assuming the new behaviour were run on an older version of Python, it would fail in an obscure way. -- Greg
I believe I'm -0 for modifying the existing add() and +0 for a new function. By the way, a good phrasing of the general mutating/non-mutating convention opens the built-in types section of the docs: "Some collection classes are mutable. The methods that add, subtract, or rearrange their members in place, and don’t return a specific item, never return the collection instance itself but None." Note the use of the word item, rather than any return value. Out of curiosity though, were we to change set.add(), how could we keep backwards compatibility in the c-api? On the Python level, we might find people do "val = set.add(i) or i", or even "any(set.add(i) for i in range (10))" which would break. I myself wrote the former quite often with "list.append" etc. On the c-api, the problem might be worse - I wouldn't be surprised if people use PyCheckExact or pointer comparison for the results of both PySet_Add and set_add of the set object. Is there a way to maintain compatibility on the c-api that I'm not aware of? On Fri, Jun 26, 2020, 3:41 AM Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
On 26/06/20 4:01 am, Steele Farnsworth wrote:
My point was only that, as far as I know, all the methods for built in container types that serve only to change what is contained return None, and that this was an intentional design choice, so changing it in one case would have to evoke a larger discussion about what those sorts of methods should return.
The reason for that convention is so that there is no confusion about which methods return new objects and which modify the object in place.
However, achieving that goal only requires that mutating methods don't return 'self'. It doesn't mean they can't return some other value that might be useful.
A wider argument against that might be that methods should be classifiable into "procedures" (that have side effects but not return values) and "functions" (that return values but don't have side effects). I'm not sure whether that's considered part of the Python design philosophy -- I don't remember seeing much discussion about it.
In this particular case, it might be better to add a new method rather than redefining the existing 'add' method, because if code assuming the new behaviour were run on an older version of Python, it would fail in an obscure way.
-- Greg _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/ZYYUMH... Code of Conduct: http://python.org/psf/codeofconduct/
On Thu, Jun 25, 2020 at 05:27:16PM +0300, Ben Avrahami wrote:
Hey all, Often I've found this kind of code:
seen = set() for i in iterable: if i in seen: ... # do something in case of duplicates else: seen.add(i) ... # do something in case of first visit
This kind of code appears whenever one needs to check for duplicates in case of a user-submitted iterable, or when we loop over a recursive iteration that may involve cycles (graph search or the like). This code could be improved if one could ensure an item is in the set, and get whether it was there before in one operation.
I disagree that it would be improved. The existing idiom is simple and flexible and perfectly understandable even if you don't known the gory details about how sets work. Most importantly, it matches the way people think about the task: # Task: look for duplicates if element in seen: # it's a duplicate ... else: # never seen before, so remember it ... This idiom works with sets, it works with lists, it works with trees, it works with anything that supports membership testing and inserting into a collection. It is the natural way to think about the task. Whereas this: # Check for duplicates. add element to the collection was it already there before you just added it? is a weird way to think about the problem. already_there = seen.add(element) if already_there: # handle the duplicate case Who thinks like that? *wink* Now, I agree that sometimes for the sake of efficiency, we need to do things in a weird way. But membership testing in sets is cheap, it's not a linear search of a huge list. So the efficiency argument here is weak. If you can demonstrate a non-contrived case where the efficiency argument was strong, then I might be persuaded that a new method (not `add`) could be justified. But either way, you also have to decide whether the `add` (or the new method) should *unconditionally* insert the element, or only do so if it wasn't present. This makes a big difference: seen = {2} already_there = seen.add(2.0) At this point, is `seen` the set {2} or {2.0}? Justify why one answer is the best answer. -- Steven
On Friday, June 26, 2020, at 04:54 -0500, Steven D'Aprano wrote:
On Thu, Jun 25, 2020 at 05:27:16PM +0300, Ben Avrahami wrote:
Hey all, Often I've found this kind of code:
seen = set() for i in iterable: if i in seen: ... # do something in case of duplicates else: seen.add(i) ... # do something in case of first visit
This kind of code appears whenever one needs to check for duplicates in case of a user-submitted iterable, or when we loop over a recursive iteration that may involve cycles (graph search or the like). This code could be improved if one could ensure an item is in the set, and get whether it was there before in one operation.
Whereas this:
# Check for duplicates. add element to the collection was it already there before you just added it?
is a weird way to think about the problem.
already_there = seen.add(element) if already_there: # handle the duplicate case
Who thinks like that? *wink*
Anyone who practices EAFP rather than LBYL? Or is that why you're winking?
But either way, you also have to decide whether the `add` (or the new method) should *unconditionally* insert the element, or only do so if it wasn't present. This makes a big difference:
seen = {2} already_there = seen.add(2.0)
At this point, is `seen` the set {2} or {2.0}? Justify why one answer is the best answer.
The actual best answer is left as an exercise for the interested reader, but whatever it is, it's justified by backwards compatibility, the existing definition of "present," and the principle of least surprise: Python 3.8.3 (default, May 17 2020, 18:15:42) [GCC 10.1.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> seen = {2} >>> 2.0 in seen True >>> seen.add(2.0) >>> seen {2} Unless the new method is called set.add_with_different_semantics. ;-)
On Fri, Jun 26, 2020 at 06:16:05AM -0500, Dan Sommers wrote:
already_there = seen.add(element) if already_there: # handle the duplicate case
Who thinks like that? *wink*
Anyone who practices EAFP rather than LBYL? Or is that why you're winking?
That doesn't come naturally, and in single-threaded code it's also unnecessary. By the way, since there's no try...except involved and no need to catch an exception, it's not EAFP.
But either way, you also have to decide whether the `add` (or the new method) should *unconditionally* insert the element, or only do so if it wasn't present. This makes a big difference:
seen = {2} already_there = seen.add(2.0)
At this point, is `seen` the set {2} or {2.0}? Justify why one answer is the best answer.
The actual best answer is left as an exercise for the interested reader, but whatever it is, it's justified by backwards compatibility, the existing definition of "present," and the principle of least surprise:
The documentation doesn't guarantee one behaviour or another: https://docs.python.org/3/library/stdtypes.html#frozenset.add
Python 3.8.3 (default, May 17 2020, 18:15:42) [GCC 10.1.0] on linux Type "help", "copyright", "credits" or "license" for more information.
seen = {2} 2.0 in seen True seen.add(2.0) seen {2}
Well I'm completely surprised, because I expected `add` to, you know, actually add the element replacing the one that was already there! Seriously, I genuinely thought that the existing behaviour was the opposite and that `add` unconditionally added the element. "Last seen wins". If I was designing sets, that's probably how I would design it. After all, it's called *add*, not *add if not already there*. I was so sure that this was the current behaviour that I didn't bother to check it before posting, which is rare for me. So I think this counts as the principle of maximal surprise :-) Should the flag be "element was already present and nothing was added" or "element was not there, so something was added"? -- Steven
On Sat, Jun 27, 2020 at 2:29 PM Steven D'Aprano <steve@pearwood.info> wrote:
Seriously, I genuinely thought that the existing behaviour was the opposite and that `add` unconditionally added the element. "Last seen wins". If I was designing sets, that's probably how I would design it. After all, it's called *add*, not *add if not already there*. I was so sure that this was the current behaviour that I didn't bother to check it before posting, which is rare for me.
So I think this counts as the principle of maximal surprise :-)
Well, that's not how Python's dicts OR sets work, so that would be inconsistent. But feel free to do it your own way in your own language. ChrisA
On 27/06/20 4:25 pm, Steven D'Aprano wrote:
Seriously, I genuinely thought that the existing behaviour was the opposite and that `add` unconditionally added the element. "Last seen wins".
The fact that you haven't noticed until now suggests that you've never written any code that depends on the difference.
If I was designing sets, that's probably how I would design it. After all, it's called *add*, not *add if not already there*.
But equally, if it's already there, there's no need to do anything, right? I don't think it's possible to intuit the behaviour from the name "add". I also think that it doesn't matter, and if it does, then a set probably isn't the right data structure for your application. -- Greg
On Sat, Jun 27, 2020 at 06:09:02PM +1200, Greg Ewing wrote:
On 27/06/20 4:25 pm, Steven D'Aprano wrote:
Seriously, I genuinely thought that the existing behaviour was the opposite and that `add` unconditionally added the element. "Last seen wins".
The fact that you haven't noticed until now suggests that you've never written any code that depends on the difference.
No, because I've tried avoid code that would behave differently in those cases. I did find one thing... I was removing an unorded sequence by dropping the elements into a set then converting back to a list, and documented it as "last seen wins" in the case of duplicates. -- Steven
On Sat, Jun 27, 2020 at 04:57:32PM +1000, Steven D'Aprano wrote:
I did find one thing... I was removing an unorded sequence by dropping the elements into a set then converting back to a list, and documented it as "last seen wins" in the case of duplicates.
Sheesh, I was removing *duplicates* from an *unordered* sequence. I need a proof reader. -- Steven
On Friday, June 26, 2020, at 23:25 -0500, Steven D'Aprano wrote:
On Fri, Jun 26, 2020 at 06:16:05AM -0500, Dan Sommers wrote:
already_there = seen.add(element) if already_there: # handle the duplicate case
Who thinks like that? *wink*
Anyone who practices EAFP rather than LBYL? Or is that why you're winking?
That doesn't come naturally, and in single-threaded code it's also unnecessary.
Not unlike ChrisA, I grew up in a multiprocessor, multithreaded, asynchronous world, and I don't always assume that single-threaded code will stay that way.
By the way, since there's no try...except involved and no need to catch an exception, it's not EAFP.
Perhaps by your standard. The code I wrote performs an operation, and then asks whether or not some condition was met, as opposed to asking whether the condition is met first, and then conditionally performing the operation.
But either way, you also have to decide whether the `add` (or the new method) should *unconditionally* insert the element, or only do so if it wasn't present. This makes a big difference:
seen = {2} already_there = seen.add(2.0)
At this point, is `seen` the set {2} or {2.0}? Justify why one answer is the best answer.
The actual best answer is left as an exercise for the interested reader, but whatever it is, it's justified by backwards compatibility, the existing definition of "present," and the principle of least surprise:
The documentation doesn't guarantee one behaviour or another:
https://docs.python.org/3/library/stdtypes.html#frozenset.add
Not explicitly, no, but the definition of a set (from that same section of documentation) is "an unordered collection of distinct hashable objects." Is 2 distinct from 2.0? Not according to Python: Python 3.8.3 (default, May 17 2020, 18:15:42) [GCC 10.1.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> 2 == 2.0 True
Python 3.8.3 (default, May 17 2020, 18:15:42) [GCC 10.1.0] on linux Type "help", "copyright", "credits" or "license" for more information.
seen = {2} 2.0 in seen True seen.add(2.0) seen {2}
Well I'm completely surprised, because I expected `add` to, you know, actually add the element replacing the one that was already there!
But 2 == 2.0, and I asked first, and the answer was that 2.0 was already in the set. (Unsurprisingly, I get the same results for 2+0i and 2.0+0.0j.) So how would you know whether or not set.add added the [not-new] element or just left the set as is? Do you write code that cares?
Seriously, I genuinely thought that the existing behaviour was the opposite and that `add` unconditionally added the element. "Last seen wins". If I was designing sets, that's probably how I would design it. After all, it's called *add*, not *add if not already there*. I was so sure that this was the current behaviour that I didn't bother to check it before posting, which is rare for me.
So I think this counts as the principle of maximal surprise :-)
Then there's a bug in the documentation. Perhaps the word "distinct" or the description of set.add is insufficient. Relatedly, would your design include both remove and discard?
Should the flag be "element was already present and nothing was added" or "element was not there, so something was added"?
Given that the name of the method in question is "add," IMO the flag would indicate that the element was added. Please don't read too much into my answer(s), though, I usually write code more like this: # process unique elements of iterable for element in set(iterable): process_element(element) and I don't care about handling duplicates one at a time, so I don't care which of multiple non-distinct elements end up in my set.
Dan Sommers writes:
Perhaps by your standard [it's not EAFP]. The code I wrote[:]
already_there = seen.add(element) if already_there: # handle the duplicate case
performs an operation, and then asks whether or not some condition was met, as opposed to asking whether the condition is met first, and then conditionally performing the operation.
Aside: Checking status returns seems like LBYL to me, too. EAFP doesn't ask at all, it responds to a tug on its sleeve. :-) I think the point that we could cheaply have a (nearly?) atomic "check_presence_and_add_if_not" operation is valid. I can see how that might be useful in coordinating resource access, for example. Given dict.setdefault, a parallel method like set.add_unique, which returns the existing element or None, might be the better approach. This would also allow the programmer to check for (2 is not 2.0) if he wants to.
Given dict.setdefault, a parallel method like set.add_unique, which returns the existing element or None, might be the better approach.
Although it breaks the convention of these methods returning the item, I'm not sure returning the element is a good idea. "set.add_unique(None)" will return None. The original example of "if dict.setdefault()" is problematic as well for the exact same reason. You can't tell if the default value was already there in the first place. @Chris The set.add is already threadsafe as GIL is not checked while running the function, but it shouldn't be relied upon, if only because of ABC not guaranteeing other implementations to be threadsafe, let alone other python implementations as well. De-facto I've never seen TOCTOU issues with these things as they don't contain external resource access and are usually surrounded by locks for plenty of other reasons, but theoretically it can be useful to prevent the need of locks. Either way, modifying the existing method is probably a no-no - it will break backwards compatibility in both Python and C-API. Regarding a new method, what are the actual arguments in favor? Apart from saving one line of code ofc. Besides, I don't believe trying to 1-line it makes the code any clearer, but that's personal taste.
Bar Harel writes:
The original example of "if dict.setdefault()" is problematic as well for the exact same reason. You can't tell if the default value was already there in the first place.
That's not a problem. In both cases if you need an object that can't possibly be there, you have # 'default' by analogy to dict.setdefault. It's a bad name for # this argument to add_unique, and does not have the semantics of # inserting default in the set. olditem = someset.add_unique(thing, default=(sentinel:=object())) if olditem is not sentinel: # handle dupe More likely you trust yourself not to insert it, and use a sentinel defined previously. Of course, we may prefer a boolean return, despite the general rule about returning elements. I'm single-threaded, and so agnostic on that. :-) But if it turns out that somebody *wants* to check "2 is 2.0", this .add_unique can serve both purposes.
On Mon, Jun 29, 2020 at 1:53 AM Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
But if it turns out that somebody *wants* to check "2 is 2.0", this .add_unique can serve both purposes.
can it though? (or should it?) -- the fact is that it was decided the 2 and 2.0 are equivalent as members of a set (which makes great sense, there are they SAME number, and the difference is an implementation detail), so the answer to: 2.0 in {2} and 2 in {2.0} is True I've lost track, but the answer to a_set.add_unique(2) should always be the same as a_set.add_unique(2.0) for the same set. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
Christopher Barker writes:
On Mon, Jun 29, 2020 at 1:53 AM Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
But if it turns out that somebody *wants* to check "2 is 2.0", this .add_unique can serve both purposes.
can it though?
Yes.
(or should it?)
YMMV, but I think that's not our call. This check *can* be done without this facility, but AFAICS it would be O(n) (iterate over the set, testing for equality, then testing for identity if found). Providing this function simply optimizes that to O(1). The argument against this isn't "should we distinguish int from float in principle", it's "should we bother with a status return at all" and "if we bother, why not KISS with a boolean return", don't you think?
-- the fact is that it was decided the 2 and 2.0 are equivalent as members of a set (which makes great sense, there are they SAME number, and the difference is an implementation detail),
No, it's not just an implementation detail. If 2.0 can sneak into an integer set, so can Inf and Nan, but any integer calculation using members of that set will not be prepared for that. And it's possible that 2.00000000001 can sneak in as well, changing an exact integer calculation into floating point calculation with rather different rules. This is the same kind of argument used to justify zip(strict=True). I.e, this only happens if there's a programmer error, and if it does, the feature helps to debug that error.
I've lost track, but the answer to
a_set.add_unique(2) should always be the same as a_set.add_unique(2.0)
for the same set.
True. What about that? That's what makes it possible to tell whether you added the same object or merely an equal one. I don't have time to think carefully about it, but it now occurs to me to spell add_unique as "intern", as in the idiom x = a_set.intern(x) Of course we already have this for str, but only str, with sys.intern. It would likely be a rare use case compared to strings, but in theory any immutable could benefit from this.
On Mon, Jun 29, 2020, at 04:51, Stephen J. Turnbull wrote:
Of course, we may prefer a boolean return, despite the general rule about returning elements. I'm single-threaded, and so agnostic on that. :-) But if it turns out that somebody *wants* to check "2 is 2.0", this .add_unique can serve both purposes.
I don't think this is a good idea - for one thing, an element return means you can't handle, or have to handle specially, None being in a set. If we want a way to check the type or identity of an item in a set, I think it would be best to add another method to access the item. Or an operator - i.e. check if type(set[2]) is int or float, or whatever it is you want. I did once write a way to do that in O(1) time in unmodified python [abusing the __eq__ method of a custom class to smuggle the value out] as an exercise, incidentally, and it seemed to work fine.
On Fri, Jun 26, 2020 at 7:58 PM Steven D'Aprano <steve@pearwood.info> wrote:
Most importantly, it matches the way people think about the task:
# Task: look for duplicates if element in seen: # it's a duplicate ... else: # never seen before, so remember it ...
Do you do this: if file exists: # read the file else: # create the file Or would you try/except around an open call? The TOCTOU issue may be less obvious with sets, but it's still a reasonable thing to concern yourself with.
This idiom works with sets, it works with lists, it works with trees, it works with anything that supports membership testing and inserting into a collection. It is the natural way to think about the task.
So would "ensure this is here, and let me know if you created it". Or "add this, and throw an exception if it was already there", which comes to the same thing, but uses exception handling as its mechanism.
Now, I agree that sometimes for the sake of efficiency, we need to do things in a weird way. But membership testing in sets is cheap, it's not a linear search of a huge list. So the efficiency argument here is weak.
TOCTOU is stronger.
But either way, you also have to decide whether the `add` (or the new method) should *unconditionally* insert the element, or only do so if it wasn't present. This makes a big difference:
seen = {2} already_there = seen.add(2.0)
At this point, is `seen` the set {2} or {2.0}? Justify why one answer is the best answer.
It should do the same thing that happens already: keep the existing one. I don't see why this is even in question. Are you searching that hard for objections? ChrisA
On Fri, Jun 26, 2020 at 10:10:22PM +1000, Chris Angelico wrote:
On Fri, Jun 26, 2020 at 7:58 PM Steven D'Aprano <steve@pearwood.info> wrote:
Most importantly, it matches the way people think about the task:
# Task: look for duplicates if element in seen: # it's a duplicate ... else: # never seen before, so remember it ...
Do you do this:
if file exists: # read the file else: # create the file
No, because I have learned that this is both unreliable and dangerous. But that's how I *think* about the problem, at least initially. I expect that if you were prototyping the code on a whiteboard, you probably would too. Checking existence of an element in a set is not comparable to file access. File access has a lot more that can go wrong than just whether or not the file exists. There are permission errors and transient errors that bulletproof code needs to deal with. And file access is always concurrent, since the OS is concurrent. (Unless you have the luxury of working on a single user, single process OS.) So long as you know you are dealing with hashable objects, `set.add` should always succeed.
Or would you try/except around an open call? The TOCTOU issue may be less obvious with sets, but it's still a reasonable thing to concern yourself with.
Unless you are in the [Quadrant of Hell](https://imgur.com/a/jjA52vI) you don't even need to think about this for sets. There are no Time Of Check to Time Of Use issues worth thinking about in the given example, and frankly Chris I doubt that you surround ever set access with locks in single-threaded code because TOCTOU is "stronger" (more important?) than writing clear and simple code. And if you do, you're probably wondering why Python is so slow *wink* But I'll grant you that I wasn't thinking about threaded code. The example given was single-threaded, but maybe a better example would involve threading. But that supports my point: thinking concurrently doesn't come naturally. Is this check and insert version of `add` threadsafe? If not, then you still need manual locking and this proposal doesn't get you much; if anything it's an attractive nuisance because it looks atomic but isn't: seen = {} assert seen.add(None) is False This could fail in threaded code if the underlying `add` method is not thread-safe: - your thread calls `add`, which checks that None is not an element; - between the check and the actual insertion, another thread adds None; - and your thread redundently adds it a second time and returns False. I don't know about Python sets, but most Java collections are not thread safe unless explicitly documented as such, so I wouldn't be the least bit surprised if this could happen. So `add` would have to be atomic at the C level. If it is already atomic, great, that's one objection neutralised. But if not, then will making it atomic have a performance impact? I'm not very interested in slowing down every single `set.add` just so I have the opportunity to write less understandable code when I don't need to. But if `set.add` is already atomic and hence thread safe, and the other issues with this proposal are addressed, then I don't object to this proposal. (I reserve the right to change my mind :-) Just don't tell me that this: was_already_there = seen.add(element) if was_already_there: ... is a more natural way of thinking about the problem of checking for duplicates than: if element in seen: ...
This idiom works with sets, it works with lists, it works with trees, it works with anything that supports membership testing and inserting into a collection. It is the natural way to think about the task.
So would "ensure this is here, and let me know if you created it".
Are you proposing to add this to lists, deques, dicts, Mappings, Sequences and all other collections? I think you'll need a PEP :-)
But either way, you also have to decide whether the `add` (or the new method) should *unconditionally* insert the element, or only do so if it wasn't present. This makes a big difference:
seen = {2} already_there = seen.add(2.0)
At this point, is `seen` the set {2} or {2.0}? Justify why one answer is the best answer.
It should do the same thing that happens already: keep the existing one.
I actually thought that the existing behaviour was the opposite, and for once I failed to check it before posting. Oops! But that's not really important. What is important is the question of which of these we wish to match: # 1 if element in seen: print("duplicate element") seen.add(element) # 2 if element in seen: print("duplicate element") else: seen.add(element) They have different semantics, and I can imagine cases where both would be useful. Although I suppose option 1 is moot if sets don't over-write the existing element. -- Steven
On Sat, Jun 27, 2020 at 4:54 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Fri, Jun 26, 2020 at 10:10:22PM +1000, Chris Angelico wrote:
On Fri, Jun 26, 2020 at 7:58 PM Steven D'Aprano <steve@pearwood.info> wrote:
Most importantly, it matches the way people think about the task:
# Task: look for duplicates if element in seen: # it's a duplicate ... else: # never seen before, so remember it ...
Do you do this:
if file exists: # read the file else: # create the file
No, because I have learned that this is both unreliable and dangerous. But that's how I *think* about the problem, at least initially. I expect that if you were prototyping the code on a whiteboard, you probably would too.
Checking existence of an element in a set is not comparable to file access. File access has a lot more that can go wrong than just whether or not the file exists. There are permission errors and transient errors that bulletproof code needs to deal with. And file access is always concurrent, since the OS is concurrent. (Unless you have the luxury of working on a single user, single process OS.)
I grew up with concurrency being a perfectly normal thing. To me, EVERYTHING is concurrent. (Back then, it was single-core CPUs, so technically most machine code instructions could be assumed to be atomic, but at any higher level, you could be preempted at any moment.)
So long as you know you are dealing with hashable objects, `set.add` should always succeed.
Or would you try/except around an open call? The TOCTOU issue may be less obvious with sets, but it's still a reasonable thing to concern yourself with.
Unless you are in the [Quadrant of Hell](https://imgur.com/a/jjA52vI) you don't even need to think about this for sets. There are no Time Of Check to Time Of Use issues worth thinking about in the given example, and frankly Chris I doubt that you surround ever set access with locks in single-threaded code because TOCTOU is "stronger" (more important?) than writing clear and simple code. And if you do, you're probably wondering why Python is so slow *wink*
And that is a fundamental difference between our philosophies. You consider threads to be terrifyingly scary, and shared mutable data to be "hell". I consider this to be the normal state of things, and that whenever you care, you take care. It actually doesn't come up all that often.
But I'll grant you that I wasn't thinking about threaded code. The example given was single-threaded, but maybe a better example would involve threading. But that supports my point: thinking concurrently doesn't come naturally.
Nor does thinking algebraically, nor thinking like a database, nor thinking in terms of diamond inheritance and an MRO. These things have to be learned. Is it a problem to have to learn things?
Is this check and insert version of `add` threadsafe? If not, then you still need manual locking and this proposal doesn't get you much; if anything it's an attractive nuisance because it looks atomic but isn't:
seen = {} assert seen.add(None) is False
This could fail in threaded code if the underlying `add` method is not thread-safe:
- your thread calls `add`, which checks that None is not an element; - between the check and the actual insertion, another thread adds None; - and your thread redundently adds it a second time and returns False.
I don't know about Python sets, but most Java collections are not thread safe unless explicitly documented as such, so I wouldn't be the least bit surprised if this could happen.
So `add` would have to be atomic at the C level. If it is already atomic, great, that's one objection neutralised. But if not, then will making it atomic have a performance impact?
Question: If it ISN'T atomic at the C level, what happens? I can imagine a few possibilities, including utter memory corruption, leaking of objects, insertion of two identical elements into a set, and many other things that would fundamentally violate Python's own integrity. What would happen that would be considered acceptable, but wouldn't guarantee this? I would expect that set.add() is guaranteed to not add a duplicate, and that right there basically guarantees everything that I need. It doesn't have to be completely atomic but it does have to maintain the set invariants. CPython guarantees this level of safety using the GIL. Other Pythons may do it in other ways. It hardly matters, because any way will suffice.
I'm not very interested in slowing down every single `set.add` just so I have the opportunity to write less understandable code when I don't need to. But if `set.add` is already atomic and hence thread safe, and the other issues with this proposal are addressed, then I don't object to this proposal. (I reserve the right to change my mind :-)
Just don't tell me that this:
was_already_there = seen.add(element) if was_already_there: ...
is a more natural way of thinking about the problem of checking for duplicates than:
if element in seen: ...
For a more fair comparison, remember that you need to add the element. So your options are: if set.ensure(element): # it wasn't already there and if element not in set: set.add(element) # it wasn't already there which is basically a wash.
This idiom works with sets, it works with lists, it works with trees, it works with anything that supports membership testing and inserting into a collection. It is the natural way to think about the task.
So would "ensure this is here, and let me know if you created it".
Are you proposing to add this to lists, deques, dicts, Mappings, Sequences and all other collections? I think you'll need a PEP :-)
No, I'm not. I'm only talking about this as a concept in things that reject duplicates. So at best, it would be sets and dicts. Why on earth would you assume I would want this operation on a list? Are you THAT desperate for an argument?
But either way, you also have to decide whether the `add` (or the new method) should *unconditionally* insert the element, or only do so if it wasn't present. This makes a big difference:
seen = {2} already_there = seen.add(2.0)
At this point, is `seen` the set {2} or {2.0}? Justify why one answer is the best answer.
It should do the same thing that happens already: keep the existing one.
I actually thought that the existing behaviour was the opposite, and for once I failed to check it before posting. Oops!
But that's not really important. What is important is the question of which of these we wish to match:
# 1 if element in seen: print("duplicate element") seen.add(element)
# 2 if element in seen: print("duplicate element") else: seen.add(element)
They have different semantics, and I can imagine cases where both would be useful. Although I suppose option 1 is moot if sets don't over-write the existing element.
They don't have different semantics, so the distinction is irrelevant. We can match them both.
seen = {1} seen.add(1.0) seen.add(True) seen {1}
The only change would be for the add call to signal whether or not it changed the set. The state of the set afterwards would not change. ChrisA
On Saturday, June 27, 2020, at 03:46 -0500, Chris Angelico wrote:
I grew up with concurrency being a perfectly normal thing. To me, EVERYTHING is concurrent. (Back then, it was single-core CPUs, so technically most machine code instructions could be assumed to be atomic, but at any higher level, you could be preempted at any moment.)
Concurrent and asynchronous. And these funny red hats. Way back when (1989ish?), I ported an [in house, proprietary] OS from the MC68000 to the MC68020 and MC68030 (https://en.wikipedia.org/wiki/Motorola_68000_series), and one of the "breakthroughs" was that individual instructions on a single core CPU were *not* atomic. IIRC, the worst case was that an individual instruction could be interrupted three or four times due to page faults, plus certain I/O or other interrupts. Not to mention that consecutive instructions could overlap in weird ways or be executed out of order. But I digress. I, too, found out later that concurrency was hard, mostly from people who *hadn't* programmed at that bare metal level.
TL;DR If the `set.add` feature can return a flag stating whether or not the element was previously present or not cheaply and atomically, without slowing down the single-threaded case with locks, then that's great (as I already stated...) and this entire subthread about threading is irrelevant and you can stop reading now. If not, then we should consider carefully the consequences of adding a feature that will either slow down the common single-threaded case, or potentially be an attractive nuisance in the multi-threaded case. On Sat, Jun 27, 2020 at 06:46:47PM +1000, Chris Angelico wrote:
I grew up with concurrency being a perfectly normal thing. To me, EVERYTHING is concurrent. (Back then, it was single-core CPUs, so technically most machine code instructions could be assumed to be atomic, but at any higher level, you could be preempted at any moment.)
Cool, but it ignores the point that when you access the file system, you cannot avoid concurrency because that's what the OS gives you whether you want it or not. That's not what happens with adding elements to a set, unless you are sharing it between threads. (For the record, I too grew up with concurrency being a perfectly normal thing. Every desktop OS I've ever touched has had some form of multi- processing or another. Classic MacOS had cooperative multiprocessing and desk accessories, DOS had TSRs, and the first computer I used for serious coding was Unix.)
Unless you are in the [Quadrant of Hell](https://imgur.com/a/jjA52vI) you don't even need to think about this for sets. There are no Time Of Check to Time Of Use issues worth thinking about in the given example, and frankly Chris I doubt that you surround ever set access with locks in single-threaded code because TOCTOU is "stronger" (more important?) than writing clear and simple code. And if you do, you're probably wondering why Python is so slow *wink*
And that is a fundamental difference between our philosophies. You consider threads to be terrifyingly scary,
Oh I do, sometimes I just start screaming at the very thought. I've even been known to wet myself.
and shared mutable data to be "hell". I consider this to be the normal state of things, and that whenever you care, you take care. It actually doesn't come up all that often.
I'm sure you're very good at what you do, but deadlocks, resource starvation, non-deterministic execution order, race conditions, etc are genuine problems for people, and concurrent code is much harder to write correctly than sequential code, and even more difficult to debug. Stories like these are hardly rare: http://thedailywtf.com/articles/Sprite_Threading http://worsethanfailure.com/articles/Flossin_0x27__My_Threads Other people have suggested that concurrent code is often riddled with race conditions: "Because concurrency is so incredibly difficult, many code bases I have worked on have these sorts of bugs….everywhere" https://medium.com/dm03514-tech-blog/golang-candidates-and-contexts-a-heuris... and the consequences of race conditions have sometimes been catastrophic: https://www.securityfocus.com/news/8412 https://en.wikipedia.org/wiki/Therac-25 which is why people much smarter than me have spent decades designing languages and execution models specifically to deal with concurrency and parallel computing safely: https://en.wikipedia.org/wiki/List_of_concurrent_and_parallel_programming_la... or simply recommending that, when possible, we ought to avoid mutable shared state (in the same way we avoid GOTOs and global variables and other anti-patterns): https://jaxenter.com/akka-anti-patterns-shared-mutable-state-128827.html https://exploringjs.com/deep-js/ch_shared-mutable-state.html http://web.mit.edu/6.031/www/fa17/classes/20-thread-safety/ I'm not sure why this is so controversial. I'm making a moderate claim here, threading is much harder to get right than sequential code.
But I'll grant you that I wasn't thinking about threaded code. The example given was single-threaded, but maybe a better example would involve threading. But that supports my point: thinking concurrently doesn't come naturally.
Nor does thinking algebraically, nor thinking like a database, nor thinking in terms of diamond inheritance and an MRO. These things have to be learned. Is it a problem to have to learn things?
Maybe. It depends on the things you have to learn and why you need to. Personally I don't feel that learning category theory and monads just to be able to print "Hello World" would be a good use of my time. https://en.wikipedia.org/wiki/Monad_%28functional_programming%29 Please let's not over-generalise here, I'm not opposed to learning in general, I'm pointing out that writing concurrent-style code for something which is trivially thought of as "if the element has not been seen, add it" is not the most natural way of thinking about the task.
So `add` would have to be atomic at the C level. If it is already atomic, great, that's one objection neutralised. But if not, then will making it atomic have a performance impact?
Question: If it ISN'T atomic at the C level, what happens?
Sometimes it will return False when it should have returned True, or True when it should have returned False, and because of that wrong answer, people will have mysterious bugs, crashes, failures, and code that silently does the wrong thing. [...]
I would expect that set.add() is guaranteed to not add a duplicate, and that right there basically guarantees everything that I need.
Don't you also need the method to guarantee to return True only if the element was already in the set, and False only if the element was not in the set? That is the point of the proposed feature. (Or maybe it was the other way around, I forget which way the flag is supposed to go.) If you don't have that guarantee, then you either can't use this feature, or you have to use manual locks, in which case the benefit is significantly reduced. [...]
This idiom works with sets, it works with lists, it works with trees, it works with anything that supports membership testing and inserting into a collection. It is the natural way to think about the task.
So would "ensure this is here, and let me know if you created it".
Are you proposing to add this to lists, deques, dicts, Mappings, Sequences and all other collections? I think you'll need a PEP :-)
No, I'm not. I'm only talking about this as a concept in things that reject duplicates. So at best, it would be sets and dicts. Why on earth would you assume I would want this operation on a list?
I commented that the "check, then insert" idiom works with lists etc as well as sets, but the "insert and return a flag" idiom doesn't work with lists etc. You countered that by saying that the "insert and return flag" would work equally well. It can't work at all if it doesn't exist. So for it to work just as well for lists as it works for sets, it has to be added to lists.
Are you THAT desperate for an argument?
Is that what we're having? I thought we were discussing the pros and cons of a proposed feature. That's what the purpose of Python-Ideas is, or at least was. That's the second time you've made that comment. It would be more considerate, open and empathic to assume I'm having a good faith discussion instead of accusing me of trolling. [...]
They have different semantics, and I can imagine cases where both would be useful. Although I suppose option 1 is moot if sets don't over-write the existing element.
They don't have different semantics, so the distinction is irrelevant.
In one the add method is unconditionally called. In the other, the add method is only sometimes called. The difference in semantics is clear if you consider that this will affect not only builtin sets, but also subclasses which may have side-effects or perform other computations. -- Steven
On Sun, Jun 28, 2020 at 3:39 AM Steven D'Aprano <steve@pearwood.info> wrote:
TL;DR
If the `set.add` feature can return a flag stating whether or not the element was previously present or not cheaply and atomically, without slowing down the single-threaded case with locks, then that's great (as I already stated...) and this entire subthread about threading is irrelevant and you can stop reading now.
If not, then we should consider carefully the consequences of adding a feature that will either slow down the common single-threaded case, or potentially be an attractive nuisance in the multi-threaded case.
CPython has the GIL. Every other Python will have some equivalent way to ensure that its own internal data structures maintain their integrity. At some point, there will be a bit of code somewhere that decides whether to change some bits in memory, or not change some bits in memory. In CPython, there's basically no way that this can be done safely without also guaranteeing the atomicity needed, and therefore there is no problem. Other implementations may have something slightly different, but the chances of a problem that doesn't also cause fundamental internal issues would be extremely low. The worst case scenario that I can imagine is a non-refcounting Python (let's say it uses mark-and-sweep garbage collection), a set is defined as a hash table, and insertion into it is performed by writing a single pointer into a single cell in the base array. In a multithreaded situation, you could have two threads simultaneously try to add the same item, or simultaneously try to add two different items. In order to guarantee that adding two items won't accidentally overwrite one (which would imply that "s.add(1)" could silently discard a completely unrelated value), there would need to be a safe way to write one pointer into one cell with a guarantee that the cell had previously been empty. And that is, in fact, precisely what is needed for the proposed feature: "insert this, and be certain that it hadn't previously been here". This is basically what "lock cmpxchg" is for on the Intel CPUs, and equivalents on others. There could be lock-free code before and after this (although in CPython, the GIL basically means there won't be), but at some point, there would need to be that "final action" that is atomic, and it *already* needs to be. This feature does not increase the atomicity requirements.
which is why people much smarter than me have spent decades designing languages and execution models specifically to deal with concurrency and parallel computing safely:
https://en.wikipedia.org/wiki/List_of_concurrent_and_parallel_programming_la...
Well... yes. That's exactly what I'm building on here. If you have a Python implementation that includes threads, then this functionality MUST already exist.
or simply recommending that, when possible, we ought to avoid mutable shared state (in the same way we avoid GOTOs and global variables and other anti-patterns):
https://jaxenter.com/akka-anti-patterns-shared-mutable-state-128827.html
https://exploringjs.com/deep-js/ch_shared-mutable-state.html
Right. Avoid it where possible, because it has many costs. But fundamentally it can't be completely eliminated without some ridiculous dances, and I'm assuming here that the set *is* being mutated from multiple threads simultaneously. If it isn't (either because it's local to one thread, or it's in a "one writer, many readers" model), it's easy, and the only place that needs to have any locking whatsoever is a hash resize or equivalent. This discussion is all about the worst case scenario: two threads concurrently trying to insert into the same set.
I'm not sure why this is so controversial. I'm making a moderate claim here, threading is much harder to get right than sequential code.
And I'm making a counter-claim: threading is an already-solved problem. Sure it's harder, but it's already harder and we're dealing with it.
Please let's not over-generalise here, I'm not opposed to learning in general, I'm pointing out that writing concurrent-style code for something which is trivially thought of as "if the element has not been seen, add it" is not the most natural way of thinking about the task.
Should you have to think about resource leakage with open() calls, or do you just learn to follow best practice and use a 'with' block? Sometimes we don't need to learn about concurrency to do the right thing, we just learn what's the right thing, and if we're curious as to WHY it's the right thing, we can *then* learn about concurrency.
So `add` would have to be atomic at the C level. If it is already atomic, great, that's one objection neutralised. But if not, then will making it atomic have a performance impact?
Question: If it ISN'T atomic at the C level, what happens?
Sometimes it will return False when it should have returned True, or True when it should have returned False, and because of that wrong answer, people will have mysterious bugs, crashes, failures, and code that silently does the wrong thing.
My point is that this cannot happen.
I would expect that set.add() is guaranteed to not add a duplicate, and that right there basically guarantees everything that I need.
Don't you also need the method to guarantee to return True only if the element was already in the set, and False only if the element was not in the set? That is the point of the proposed feature.
Propose a way that it could get the return value wrong, but couldn't also accidentally discard some other item, or corrupt fundamental internals (like CPython's reference counter).
No, I'm not. I'm only talking about this as a concept in things that reject duplicates. So at best, it would be sets and dicts. Why on earth would you assume I would want this operation on a list?
I commented that the "check, then insert" idiom works with lists etc as well as sets, but the "insert and return a flag" idiom doesn't work with lists etc. You countered that by saying that the "insert and return flag" would work equally well. It can't work at all if it doesn't exist. So for it to work just as well for lists as it works for sets, it has to be added to lists.
When do you need this for a list, and do you actually need this guarantee of atomicity?
Are you THAT desperate for an argument?
Is that what we're having? I thought we were discussing the pros and cons of a proposed feature. That's what the purpose of Python-Ideas is, or at least was.
That's the second time you've made that comment. It would be more considerate, open and empathic to assume I'm having a good faith discussion instead of accusing me of trolling.
I'm honestly not sure what your motivation is. You're clutching at tiny straws in order to debate a point. Either that, or you're so terrified of threading that you don't want to learn anything about it, and yet want to debate against it. Please, prove me wrong. I just keep finding you to be argumentative rather than actually trying to find the truth.
They have different semantics, and I can imagine cases where both would be useful. Although I suppose option 1 is moot if sets don't over-write the existing element.
They don't have different semantics, so the distinction is irrelevant.
In one the add method is unconditionally called. In the other, the add method is only sometimes called. The difference in semantics is clear if you consider that this will affect not only builtin sets, but also subclasses which may have side-effects or perform other computations.
You said yourself that option 1 is moot (since sets indeed don't overwrite). So the true semantic difference with core sets is that one of them has a bug in it (a TOCTOU flaw) and the other doesn't. With subclasses, well, anything could happen - the "in" operator could be implemented strictly backwards, or the "add" method could be written to throw an exception if today is Friday, or anything. We aren't defining semantics for every subclass, just for the core. ChrisA
participants (13)
-
Alex Hall
-
Bar Harel
-
Ben Avrahami
-
Brett Cannon
-
Chris Angelico
-
Christopher Barker
-
Dan Sommers
-
Greg Ewing
-
Jonathan Fine
-
Random832
-
Steele Farnsworth
-
Stephen J. Turnbull
-
Steven D'Aprano