Interning own classes like strings for speed and size?

Mon Dec 27 18:19:00 EST 2010

On 27 December 2010 22:05, Ulrich Eckhardt <doomster at knuut.de> wrote:

> What I'm now considering is to only allow a single instance of these
> objects
> for each set of values, similar to interned strings. What I would gain is
> that I could safely compare objects for identity instead of equality. What
> I'm not yet sure is how much overhead that would give me and/or how to keep
> it low. The idea is to store each instance in a set and after creating a
> new
> object I would first look up an equal object in the global set and return
> that instead, otherwise add the new one.
>
> The problem I foresee is that if I define equality as identity, this lookup
> when creating will never eliminate duplicates. If I only fall back to
> equality comparison for non-identical objects, I would probably sacrifice
> most of the gain. If I build a dict mapping between the values and the
> actual objects, I would have doubled the required memory and uselessly
> store
> the same values twice there.
>

The first thing to deal with the equality check. The way this is generally
done is to first do an identity check, then if that fails fall back to an
equality check. This gives you a fast path for the normal case, but still
gives full equality checks on a slow path.

Your assumption of double storage for a dict is somewhat flawed if I
understand you correctly. The mapping:

(value1, value2, ...) => my_object(value1, value2, ...)

*could* result in value1, value2, ... being created and stored twice (hence
the possibility of double storage) and the mapping tuple being stored + your
object. However, if the key and value are the same object, there is only a
single additional reference being stored (within the dict structure of
course).

The way you should probably deal with this is to always create one of your
objects for doing the lookup. Then your algorithm is:

new_object = my_object(value1, value2, ...)

try:
    canonical = canonical_dict[new_object]
except KeyError:
    canonical = canonical_dict[new_object] = new_object

You'd have to structure your __new__ appropriately to do it there, but it is
possible assuming that everything you need for equality testing is done in
__new__.

If you further want to reduce storage (if it's an issue) you could also
canonicalise the values themselves using a similar technique. You could even
use the same canonicalisation dictionary so long as you could ensure that
none of the different types compare equal (e.g. floats and integers). Note
that as an implementation detail the integers -5...256 are already interned,
but you can't rely on that (the range has changed over time).

Tim Delaney
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20101228/5f62762b/attachment.html>