Mailman 3 Sets for easy interning(?) - Python-ideas

Sets for easy interning(?)

Soni L.

Dec. 3, 2019

1:25 a.m.

This is an odd request but it'd be nice if, given a set s = {"foo"}, s["foo"] returned the "foo" object that is actually in the set, or KeyError if the object is not present. Even use-cases where you have different objects whose differences are ignored for __eq__ and __hash__ and you want to grab the one from the set ignoring their differences would benefit from this.

Show replies by date

Inada Naoki

December 2019

1:33 a.m.

FWIW, you can do it with dict already. o = memo.setdefault(o, o) On Tue, Dec 3, 2019 at 9:29 AM Soni L. <fakedme+py@gmail.com> wrote:

...

This is an odd request but it'd be nice if, given a set s = {"foo"}, s["foo"] returned the "foo" object that is actually in the set, or KeyError if the object is not present.

Even use-cases where you have different objects whose differences are ignored for __eq__ and __hash__ and you want to grab the one from the set ignoring their differences would benefit from this. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/T3Z32D... Code of Conduct: http://python.org/psf/codeofconduct/

-- Inada Naoki <songofacandy@gmail.com>

Kyle Stanley

7:02 a.m.

...

FWIW, you can do it with dict already.

...

o = memo.setdefault(o, o)

I don't think this has quite the same behavior as the OP is looking for, since dict.setdefault() will insert the key and return the default when it's not present, instead the OP wanted to raise "KeyError if the object is not present". In order to raise a KeyError from a missing key and have the values be the same as the keys, one could build a dictionary like this: ``` d = {} for item in sequence: d[item] = item ``` or using comprehension: ``` d = {item: item for item in sequence} ``` and then ``` try: val = d['foo'] except KeyError: ... ``` But yeah, this behavior already exists for dictionaries. Personally, I think some_set['foo'] would likely: 1) Not make much sense for usage in sets, from a design perspective. 2) Lack practical value, as opposed to simply using a dictionary for the same purpose. To me, this feature doesn't seem worthwhile to implement or maintain. There are probably other reasons to consider as well. However, if the OP wants to personally implement this behavior for their own subclass of sets (instead of adding it to the language), that could be done rather easily: ```

...

...
...
class MySet(set): ... def __getitem__(self, key): ... if key not in self: ... raise KeyError(f'{key} not present in set') ... else: ... return key ... s = MyClass({'a', 'b', 'c'}) s['a'] 'a' s['d'] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 4, in __getitem__ KeyError: 'd' not present in set


This may not have the same performance as d[key] since it has the
conditional check for membership, but it provides the same functionality.
In 90% of use cases the performance difference should be very negligible. I
don't think that I'd advocate for using the above instead of a dict, but
it's rather straightforward to implement if desired.

On Mon, Dec 2, 2019 at 7:39 PM Inada Naoki &lt;songofacandy@gmail.com> wrote:

> FWIW, you can do it with dict already.
>
> o = memo.setdefault(o, o)
>
> On Tue, Dec 3, 2019 at 9:29 AM Soni L. &lt;fakedme+py@gmail.com> wrote:
> >
> > This is an odd request but it'd be nice if, given a set s = {"foo"},
> > s["foo"] returned the "foo" object that is actually in the set, or
> > KeyError if the object is not present.
> >
> > Even use-cases where you have different objects whose differences are
> > ignored for __eq__ and __hash__ and you want to grab the one from the
> > set ignoring their differences would benefit from this.
> > _______________________________________________
> > Python-ideas mailing list -- python-ideas@python.org
> > To unsubscribe send an email to python-ideas-leave@python.org
> > https://mail.python.org/mailman3/lists/python-ideas.python.org/
> > Message archived at
> https://mail.python.org/archives/list/python-ideas@python.org/message/T3Z32DEMWK46EBPULYB4CVI2QF4FS3WJ/
> > Code of Conduct: http://python.org/psf/codeofconduct/
>
>
>
> --
> Inada Naoki  &lt;songofacandy@gmail.com>
> _______________________________________________
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-leave@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-ideas@python.org/message/Q2EPKWPSVG55A3CKVCLJGRX6SPKKSIEE/
> Code of Conduct: http://python.org/psf/codeofconduct/
>

Random832

7:38 a.m.

On Tue, Dec 3, 2019, at 01:02, Kyle Stanley wrote:

...

However, if the OP wants to personally implement this behavior for their own subclass of sets (instead of adding it to the language), that could be done rather easily:

The OP wanted to return the object that was in the set. I don't think there's currently a way to get it in O(1) time [you can get it in O(N) with a naive loop or with s - (s - {key})]. intersection returns precisely the wrong object in my tests.

Andrew Barnert

10:27 a.m.

On Dec 2, 2019, at 22:40, Random832 <random832@fastmail.com> wrote:

...

The OP wanted to return the object that was in the set. I don't think there's currently a way to get it in O(1) time

Yeah, intersection is basically just {key for key in smaller_set if key in larger_set}, so it’s always going to return the wrong thing—unless set is 1 element long, in which case it depends whether you do set&{k} or {k}&set. I don’t think there’s any way to do this in better than linear time without access to the internals of the table. But it would be trivial to add to setobject.c. (It’s the same as contains, but instead of comparing entry->key to the key, you just return it.) Given that it’s occasionally useful, easy to implement on the type, and impossible to implement from outside, maybe it’s worth adding. I don’t think it should be spelled s[key], because that’s confusing. But a method s.lookup(key) that returned the member equal to key or raised KeyError doesn’t seem like it would confuse anyone. (Meanwhile, if you need this behavior in Python today, and you can’t accept linear time, but can accept a significant constant multiplier, you could always grab the old 2.3 sets module, change the underlying dict from mapping each key to None to instead map it to itself, and then the method is just return self._dict[key].)

Serhiy Storchaka

10:13 a.m.

03.12.19 02:25, Soni L. пише:

...

This is an odd request but it'd be nice if, given a set s = {"foo"}, s["foo"] returned the "foo" object that is actually in the set, or KeyError if the object is not present.

Even use-cases where you have different objects whose differences are ignored for __eq__ and __hash__ and you want to grab the one from the set ignoring their differences would benefit from this.

It was discussed before. The conclusion was that there are too small use cases for this (although with adding := and discussing the addition of + for dicts this is no longer considered a strong argument) and in these cases a dict works as well as a set.

Andrew Barnert

10:54 a.m.

On Dec 2, 2019, at 16:27, Soni L. <fakedme+py@gmail.com> wrote:

...

Even use-cases where you have different objects whose differences are ignored for __eq__ and __hash__ and you want to grab the one from the set ignoring their differences would benefit from this.

A more concrete use case might help make the argument better. My first thought was a function that needs the zero value for a set of (ints, floats, polynomials, whatever) could just do elements[0], but how often does that come up? (And how often can you trust that the type’s zero value is ==0, but can’t trust that the nullary constructor returns a zero?) I’m pretty sure I have run into a handful of more useful uses for this method over the years, but I can’t remember any. Maybe something to do with Unicode normalization?

Steven D'Aprano

12:34 p.m.

On Tue, Dec 03, 2019 at 01:54:44AM -0800, Andrew Barnert via Python-ideas wrote:

...

On Dec 2, 2019, at 16:27, Soni L. <fakedme+py@gmail.com> wrote:

...
Even use-cases where you have different objects whose differences are ignored for __eq__ and __hash__ and you want to grab the one from the set ignoring their differences would benefit from this.

A more concrete use case might help make the argument better.

Is interning concrete enough? The Python interpreter interns at least two kinds of objects: ints and strings, or rather, *some* ints and strings. Back in Python 1.5, there was a built-in for interning strings: # Yes I still have a 1.5 interpreter :-) >>> a = intern("hello world") >>> b = intern("hello world") >>> a is b 1 so perhaps people might like to track down the discussion for and against removing intern. We can get the same effect with a dict, but at the cost of using two pointers per interned object (one as the key, one as the value): cache = {} def intern(obj): return cache.setdefault(obj, obj) You could cut that to one pointer by using a set, at the expense of making retrieval slower and more memory-hungry: # untested cache = set() def intern(obj): if obj in cache: return cache - (cache - {obj}) cache.add(obj) return obj The interpreter interns only a subset of ints and strings because to intern more would just waste memory for no use. But that's because the interpreter has to consider arbitrary programs. If I knew that my program was generating billions of copies of the same subset of values, I might be able to save memory (and time?) by interning them. This is terribly speculative of course, but with no easy way to experiment, speculating is all I can do. -- Steven

Soni L.

1:22 p.m.

On 2019-12-03 8:34 a.m., Steven D'Aprano wrote:

...

On Tue, Dec 03, 2019 at 01:54:44AM -0800, Andrew Barnert via Python-ideas wrote:

...
On Dec 2, 2019, at 16:27, Soni L. <fakedme+py@gmail.com> wrote:

...
Even use-cases where you have different objects whose differences are ignored for __eq__ and __hash__ and you want to grab the one from the set ignoring their differences would benefit from this.

A more concrete use case might help make the argument better.

Is interning concrete enough?

The main reason I spelled "interning" as "interning(?)" is that, uh, as far as I can tell we kinda lack weak sets, and they're pretty important for interning. I could be wrong tho. Other than that, I'd definitely prefer sets over dicts for interning. I also believe sets are better represented as key-key mappings, not key-None nor key-True, as such I've taken to treating sets as equivalent to key-key mappings for the purposes of my library, but this is a bit of a pain point due to the lack of indexing. Both lists and dicts have indexing, so there's no issue treating them as mappings, but sets *don't* have indexing, so a current wart in my DSL is that you can index lists and dicts but not sets, and yet you can iterate and filter all 3. I could (and maybe I should) add a special case for sets, but idk.

...

The Python interpreter interns at least two kinds of objects: ints and strings, or rather, *some* ints and strings. Back in Python 1.5, there was a built-in for interning strings:

# Yes I still have a 1.5 interpreter :-) >>> a = intern("hello world") >>> b = intern("hello world") >>> a is b 1

so perhaps people might like to track down the discussion for and against removing intern.

We can get the same effect with a dict, but at the cost of using two pointers per interned object (one as the key, one as the value):

cache = {} def intern(obj): return cache.setdefault(obj, obj)

You could cut that to one pointer by using a set, at the expense of making retrieval slower and more memory-hungry:

# untested cache = set() def intern(obj): if obj in cache: return cache - (cache - {obj}) cache.add(obj) return obj

The interpreter interns only a subset of ints and strings because to intern more would just waste memory for no use. But that's because the interpreter has to consider arbitrary programs. If I knew that my program was generating billions of copies of the same subset of values, I might be able to save memory (and time?) by interning them.

This is terribly speculative of course, but with no easy way to experiment, speculating is all I can do.

Andrew Barnert

7:26 p.m.

...

On Dec 3, 2019, at 03:41, Steven D'Aprano <steve@pearwood.info> wrote: On Tue, Dec 03, 2019 at 01:54:44AM -0800, Andrew Barnert via Python-ideas wrote:

...
...
...
...
On Dec 2, 2019, at 16:27, Soni L. <fakedme+py@gmail.com> wrote: Even use-cases where you have different objects whose differences are ignored for __eq__ and __hash__ and you want to grab the one from the set ignoring their differences would benefit from this. A more concrete use case might help make the argument better.

Is interning concrete enough?

No. A concrete use for interning would be, but interning itself isn’t. If you’re using interning for functionality, to distinguish two equal strings that came from different inputs or processes, your code is probably broken. Python is allowed to merge distinct equal values of builtin immutable types whenever it wants to. And different interpreters, and even different CPython versions, may do that in different cases. That means any code that relies on the result of is on two equal immutable values is wrong. If you don’t care about portability or future compatibility, you could always work out the rules for one interpreter, version, and build. But they’re pretty complicated. IIRC, the current rules for a default build of CPython are something like this: * Two equal string literals in the same scope are identical. * Two string expressions in the same scope with equal values that the optimizer is able to turn into constants are identical. * There’s some rule for interactive literals that I don’t remember, so even though two top-level interactive statements are compiled and evaluated as separate scopes they can still share constant string values. * Two empty strings are identical if they’re created by any builtin, but it’s possible to create distinct ones with the C API. * Some single-character strings are treated the same as the empty string; the exact set is a compile-time option but defaults to all printable ASCII characters or all ASCII characters or something like that. * Copying a string with [:] or even copy.deepcopy gives you the same string. And there are similar but not identical rules for bytes and int, while bools and None are stricter (even C extensions can’t give you a distinct but equal None value), and float and tuple are looser (inf is a singleton like “”, but every float('inf’) returns a new value anyway). And I can’t remember how tuple scope merging changed when tuples deeper than 1 were allowed to become constants. So, what can you actually safely do with interning? You could try to optimize your code by interning a bunch of your strings and then using `a is b or a == b` instead of just `a == b`, but this will almost always make it slower, not faster. What about optimizing for memory instead of speed? Interning a string would waste, say, 24 bytes, but if you have 1000 copies of that same string, N+24 is a lot better than N*1000. But what kind of application are you building that stores vast numbers of duplicates of strings and isn’t storing them in a set or dict or database or custom b-tree or trie or whatever? And once you do that, it doesn’t matter whether the boxed Python values are interned, only whether the values inside that data structure are collapsed (and in all those cases, they either are or trivially could be). Maybe you can come up with some application that does need to store a billion copies of only a thousand strings, and needs to store them in a list (or a billion separate locals, I guess…). If so, then you’ve got a concrete use case.

...

The Python interpreter interns at least two kinds of objects: ints and strings, or rather, *some* ints and strings.

This is of course the CPython interpreter; different interpreters will be different.

...

Back in Python 1.5, there was a built-in for interning strings:

# Yes I still have a 1.5 interpreter :-)

...
...
...
a = intern("hello world") b = intern("hello world") a is b 1

And (at least in Pythonista, which currently embeds CPython 3.6.1, but I’m not sure its REPL behavior is always identical to the stock one):

...

...
...
a = 'hello' b = 'hello' a is b True

By the way, intern was still there until 2.7, but in that list of “we can’t deprecate these but please never use them” functions at the end of builtins, so you didn’t actually need 1.5 to test it. But I understand; you can never be too sure that the 2.0 license won’t turn out to be as unusable as the 1.6 license, so you need something to fall back on. :)

Chris Angelico

7:30 p.m.

On Wed, Dec 4, 2019 at 5:27 AM Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:

...

By the way, intern was still there until 2.7, but in that list of “we can’t deprecate these but please never use them” functions at the end of builtins, so you didn’t actually need 1.5 to test it. But I understand; you can never be too sure that the 2.0 license won’t turn out to be as unusable as the 1.6 license, so you need something to fall back on. :)

It's still around - it's just called sys.intern() instead of being a builtin. ChrisA

Greg Ewing

12:44 a.m.

On 4/12/19 7:26 am, Andrew Barnert via Python-ideas wrote:

...

If you’re using interning for functionality, to distinguish two equal strings that came from different inputs or processes, your code is probably broken.

That's not what interning is normally used for. Usually it's to allow test for equality to be replaced by tests for identity. -- Greg

Soni L.

12:53 a.m.

On 2019-12-03 8:44 p.m., Greg Ewing wrote:

...

On 4/12/19 7:26 am, Andrew Barnert via Python-ideas wrote:

...
If you’re using interning for functionality, to distinguish two equal strings that came from different inputs or processes, your code is probably broken.

That's not what interning is normally used for. Usually it's to allow test for equality to be replaced by tests for identity.

That's not what interning is normally used for. It's for lowering RAM usage. Okay, sometimes it's also used for that. But the main use-case is for lowering RAM usage for immutable objects. Granted, Python doesn't have immutable objects, so you're basically just hoping nobody messes with the returned objects...

Greg Ewing

7:11 a.m.

On 4/12/19 12:53 pm, Soni L. wrote:

...

Okay, sometimes it's also used for that. But the main use-case is for lowering RAM usage for immutable objects.

Citation needed. If that's true, why does Python intern names used in code, but not strings in general? I'd say because looking names up in dicts benefits enormously from being able to quickly compare for equality. -- Greg

Chris Angelico

7:33 a.m.

On Wed, Dec 4, 2019 at 5:13 PM Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:

...

On 4/12/19 12:53 pm, Soni L. wrote:

...
Okay, sometimes it's also used for that. But the main use-case is for lowering RAM usage for immutable objects.

Citation needed. If that's true, why does Python intern names used in code, but not strings in general? I'd say because looking names up in dicts benefits enormously from being able to quickly compare for equality.

It's a trade-off between the work needed to intern every string, and the memory savings from reusing them. Some languages do indeed guarantee that EVERY string (not just literals) is interned. CPython has a weaker policy (and Python-the-language doesn't have any guarantee), but if you want to create a Python interpreter that values memory usage above all else, one logical thing to do would indeed be to intern everything. ChrisA

Steven D'Aprano

9:22 a.m.

On Wed, Dec 04, 2019 at 07:11:39PM +1300, Greg Ewing wrote:

...

On 4/12/19 12:53 pm, Soni L. wrote:

...
Okay, sometimes it's also used for that. But the main use-case is for lowering RAM usage for immutable objects.

Citation needed. If that's true, why does Python intern names used in code, but not strings in general?

py> s = "ab1234z" py> t = "ab1234z" py> s is t True CPython doesn't *just* intern names. Nor does it intern every string. But it interns a lot of strings which aren't used as names, including some which cannot be used as names: py> a = "+" py> b = "+" py> a is b True It also interns many ints, and they can't be used as names at all. Here's a good explanation of interning in Python 2.7, including a great example of how interning strings can reduce memory usage by 68%. http://guilload.com/python-string-interning/ -- Steven

Kyle Stanley

1:30 p.m.

...

It also interns many ints, and they can't be used as names at all.

To clarify on "many ints", integers in the range of -5 to 256 (inclusive) are interned. This can be demonstrated with the following: ```py

...

...
...
a = 256 b = 256 a is b True a = 257 b = 257 a is b False a = -5 b = -5 a is b True a = -6 b = -6 a is b False


On Wed, Dec 4, 2019 at 3:30 AM Steven D'Aprano &lt;steve@pearwood.info> wrote:

> On Wed, Dec 04, 2019 at 07:11:39PM +1300, Greg Ewing wrote:
> > On 4/12/19 12:53 pm, Soni L. wrote:
> > >Okay, sometimes it's also used for that. But the main use-case is for
> > >lowering RAM usage for immutable objects.
> >
> > Citation needed. If that's true, why does Python intern
> > names used in code, but not strings in general?
>
>     py> s = "ab1234z"
>     py> t = "ab1234z"
>     py> s is t
>     True
>
>
> CPython doesn't *just* intern names. Nor does it intern every string.
> But it interns a lot of strings which aren't used as names, including
> some which cannot be used as names:
>
>     py> a = "+"
>     py> b = "+"
>     py> a is b
>     True
>
> It also interns many ints, and they can't be used as names at all.
>
> Here's a good explanation of interning in Python 2.7, including a great
> example of how interning strings can reduce memory usage by 68%.
>
>
> http://guilload.com/python-string-interning/
>
>
>
> --
> Steven
> _______________________________________________
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-leave@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-ideas@python.org/message/H6WOEL5CJ67P36AE3H425EDY4MGWE7K2/
> Code of Conduct: http://python.org/psf/codeofconduct/
>

Jonathan Fine

11:40 a.m.

Consider these two examples:

...

...
...
{0} == {0.0} == {False} True hash(0) == hash(0.0) == hash(False) True 0.0 in {False} True

...

...
...
class mystr(str): pass 'hi' in {mystr('hi')} True

The original poster want a way to obtain the actual object that is in the set, rather than just a truth value. This can be done in O(n) time, by iterating through the set. However, better is possible. Here's are examples, followed by implementaion.

...

...
...
from hashhack import HashHack HashHack(2) in {2} (<class 'int'>, 2) False HashHack(2) in {2.0} (<class 'float'>, 2.0) False

Here's the implementation. <BEGIN> class HashHack: def __init__(self, obj): self.hash_obj = hash(obj) def __hash__(self): return self.hash_obj def __eq__(self, other): print((type(other), other)) return False <END> Looking at this URL helped me https://stackoverflow.com/questions/3588776/how-is-eq-handled-in-python-and-... -- Jonathan

Soni L.

2:36 p.m.

On 2019-12-11 7:40 a.m., Jonathan Fine wrote:

...

Consider these two examples:

...
...
...
{0} == {0.0} == {False} True hash(0) == hash(0.0) == hash(False) True 0.0 in {False} True

...
...
...
class mystr(str): pass 'hi' in {mystr('hi')} True

The original poster want a way to obtain the actual object that is in the set, rather than just a truth value. This can be done in O(n) time, by iterating through the set. However, better is possible.

Here's are examples, followed by implementaion.

...
...
...
from hashhack import HashHack HashHack(2) in {2} (<class 'int'>, 2) False HashHack(2) in {2.0} (<class 'float'>, 2.0) False

Here's the implementation. <BEGIN> class HashHack:

def __init__(self, obj): self.hash_obj = hash(obj)

def __hash__(self): return self.hash_obj

def __eq__(self, other): print((type(other), other)) return False <END>

So you could do class Finder: def __init__(self, obj): self.obj = obj def __hash__(self): return self.obj.__hash__() def __eq__(self, other): res = self.obj == other if res: self.found_obj = other return res finder = Finder(x) if finder in foo: return finder.found_obj ?

...

Looking at this URL helped me https://stackoverflow.com/questions/3588776/how-is-eq-handled-in-python-and-...

-- Jonathan

_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/HU2HJ4... Code of Conduct: http://python.org/psf/codeofconduct/

Andrew Barnert

4:17 a.m.

On Dec 3, 2019, at 15:45, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:

...

On 4/12/19 7:26 am, Andrew Barnert via Python-ideas wrote:

...
If you’re using interning for functionality, to distinguish two equal strings that came from different inputs or processes, your code is probably broken.

That's not what interning is normally used for. Usually it's to allow test for equality to be replaced by tests for identity.

But why do you want to replace those tests? If it’s not for performance or for functionality, why do you care? (In C it could be about convenience/readability—strcmp is ugly and easy to get wrong—but that doesn’t apply to Python.)

Steven D'Aprano

5:44 a.m.

On Tue, Dec 03, 2019 at 10:26:35AM -0800, Andrew Barnert wrote:

...

If you’re using interning for functionality, to distinguish two equal strings that came from different inputs or processes, your code is probably broken.

That's not how interning works. The purpose of interning is to *remove* the distinction between values that come from different inputs, to guarantee that they are the same object. Not to distinguish them!

...

Python is allowed to merge distinct equal values of builtin immutable types whenever it wants to.

True. And so am *I*, the coder, but I have to do it myself, the language no longer has a built-in intern() function to help (and even when it did, it only worked on strings, not ints or floats or fractions or tuples of same).

...

And different interpreters, and even different CPython versions, may do that in different cases. That means any code that relies on the result of is on two equal immutable values is wrong.

No. That means any code that relies on the *interpreter* interning values in a particular way is wrong. If the code itself does its own interning, then it controls what gets interned and when, using whatever strategy makes sense for its own use. Why would you want to? Well, we already have at least one std lib memoisation decorator, `functools.lru_cache`, and that's sort of a kind of interning, so the idea is clearly not that preposterous. Whether it would be useful in practice is, as I already acknowledged, rather speculative.

...

You could try to optimize your code by interning a bunch of your strings and then using `a is b or a == b` instead of just `a == b`, but this will almost always make it slower, not faster.

I don't believe that assertion without evidence: 1. A lot of collections define element equality using an identity test first as an optimization (even if that means that they do the wrong thing when NANs are involved). So that's prima facie evidence that using `is` will be faster. 2. That also includes strings. Being able to do an `is` comparison is a major speed-up for large strings: $ ./python -m timeit -s "s = 'abcde'*1000000" -s "t = s" "s == t" 1000000 loops, best of 5: 313 nsec per loop $ ./python -m timeit -s "s = 'abcde'*1000000" -s "t = s[0] + s[1:]" "s == t" 20 loops, best of 5: 15.6 msec per loop 3. `is` is a pointer comparison handled by the interpreter as a single opcode; `==` is an operator which has to look up the object's class, look up its `__eq__` method, and call it. The overhead is much higher. but in any case, the purpose of interning is not to encourage the coder to use `is`. Generally it is to save the time required to construct new instances (if possible), or at least save the memory required to hold lots of equal immutable instances.

...

...
The Python interpreter interns at least two kinds of objects: ints and strings, or rather, *some* ints and strings.

This is of course the CPython interpreter; different interpreters will be different.

Yes, you are correct, mea culpa. Anyway, I think I've said enough about interning. Without a good way to experiment, it's hard to say whether the idea would go anywhere or not, or whether it offers anything that lru_cache doesn't offer. -- Steven

Andrew Barnert

7:31 a.m.

On Dec 3, 2019, at 20:45, Steven D'Aprano <steve@pearwood.info> wrote:

...

1. A lot of collections define element equality using an identity test first as an optimization (even if that means that they do the wrong thing when NANs are involved). So that's prima facie evidence that using `is` will be faster.

No, that’s the point. If the type already does an identity test first, doing an extra identity test first will probably be slower. Certainly it will if they aren’t identical, but even if they are, there’s the cost of the extra “or” opcodes (at least if the type is a builtin).

1903

Age (days ago)

1911

Last active (days ago)

List overview

Download

21 comments

10 participants

participants (10)

Andrew Barnert
Chris Angelico
Greg Ewing
Inada Naoki
Jonathan Fine
Kyle Stanley
Random832
Serhiy Storchaka
Soni L.
Steven D'Aprano

Sets for easy interning(?)

tags

participants (10)