[Tutor] Re: When was interned strings dropped (and why)?

Fri Jan 30 20:44:02 EST 2004

On Fri, 30 Jan 2004, Isr Gish wrote:

> I find that only int's and string's have the same id when they are
> identical. But list's, tuple's and dictionary's have different id's even
> if theybare identical.

Hi everyone,

[Warning: somewhat fuzzy discussion about identity/equality/intern
follows.  Ask questions if it gets confusing.]

I think people are getting confused between "identity" and "equality".
"Identity" is a much more strict definition than "equality".  Equality
means that two things look the same when we repr() them out.

In Python, when we talk about "identity" between two objects, we mean that
those two objects live in the same place in memory.

Let's say that we have two strings:

###
>>> s1 = "hello world"
>>> s2 = "hello" + " " + "world"
###

Now it's very true that s1 and s2 look the same, so we'd consider them
"equal":

###
>>> s1
'hello world'
>>> s2
'hello world'
>>> s1 == s2
1
###

Despite this, they live in different locations in our computer's memory.

###
>>> id(s1)
135629440
>>> id(s2)
135629920
###

And because of this, 's1' and 's2' are not "identical", even though they
are "equal".  We can use the 'is' operator to check this strict notion of
'identity'.

###
>>> s1 is s2
0
###

[Note: if you have Java experience, be careful not to get Java and Python
confused.  Python's '==' operation is equivalent to the
java.lang.Object.equals() method, and Python's 'is' operator is equivalent
to Java's '==' operator.  So the concepts are the same, but the syntax
used in the two languages is slightly flipped.  *grin*]

So when is it possible for two things to be 'identical'?  One way is to
make a binding to an existing thing.  For example, we can make another
binding to the thing that s1 is pointed to:

###
>>> s1_rebound = s1
>>> id(s1_rebound)
135629440
###

Let's look at this as a picture with names and arrows:

                       s1_rebound

                           |
                           |
                           V

     s1 -----------> "hello world" (at location #135629440)

     s2 -----------> "hello world" (at location #135629920)

Here, 's1' and 's1_rebound' are directed at the same thing, so they are
both 'equal' and 'identical'.

###
>>> s1_rebound is s1
1
###

(If two things are identical, they have to be equal.  This point is
important a few paragraphs below.)

So that's one way to get two names to point to the same thing.  A
variation on this is to keep a table of existing objects in some
container.  Whenever we're about to make a new thing, we can check our
container, and see if it's already been made before.  What this allows us
to do is to squash any duplicates.

Here's an example that shows the technique:

###
>>> cache = {}
>>> def my_intern(s):
...     if s not in cache:
...         cache[s] = s
...     return cache[s]
...
>>> s1_interned = my_intern(s1)
>>> s2_interned = my_intern(s2)
>>> s1_interned, s2_interned
('hello world', 'hello world')
>>> s1_interned is s2_interned
1
###

This is a homebrewed version of what intern() does for us: it keeps a
global cache of strings that it has already seen, so that we can collapse
duplicate strings away by calling intern() on any string that comes at us.

But why do we care about doing intern() on strings at all?  One strong
reason is because if two strings are identical, that implies that they are
equal.  And comparing two objects for 'identity' is very cheap: we just
look at where the two objects live.

###
>>> id(s1_interned)
135629440
>>> id(s2_interned)
135629440
###

If they are in the same physical space, they have to be equal.  And if
they're not in the same physical space, they might be equal... But if we
guarantee that all strings have passed through intern(), then equality
checking is very cheap.

"Equality" checking, on the other hand, might be expensive.  Think of what
happens if we give Python two strings:

###
>>> "abracadabra" == "abracadabru"
0
###

Of course we know that these strings aren't equal, but how does Python
know this?  If Python sees something like:

    "abracadabra" == "abracadabru"

then what does it need to do to check this?  It might actually need to
do more work than we'd expect, something like:

    ('a' == 'a' and
     'b' == 'b' and
     'r' == 'r' and
     'a' == 'a' and
     'c' == 'c' and
     'a' == 'a' and
     'd' == 'd' and
     'a' == 'a' and
     'b' == 'b' and
     'r' == 'r' and
     'a' == 'u')
###

And that's one reason why people will sometimes embrace all this
complexity of identity vs equality and intern(): for the sake of making
equality comparison fast, to avoid the character-by-character comparisons.

This being said, I've seldom done anything with intern().  *grin* intern()
adds enough complexity to a program that it's often not worth worrying
about it unless we know that we need it.  In the casual case, string
comparison isn't that expensive.

And it's all too easy to mess things up by intern()ing every string we
see: remember that intern() has to keep a container of every unique string
that it has ever seen.  There are a heck of a lot of unique strings out
there.  *grin*

Hope this helps!