[Tutor] Why is an instance smaller than the sum of its components?

Wed Feb 4 00:18:16 CET 2015

On Tue, Feb 03, 2015 at 10:12:09PM +0100, Jugurtha Hadjar wrote:
> Hello,
> 
> I was writing something and thought: Since the class had some 
> 'constants', and multiple instances would be created, I assume that each 
> instance would have its own data. So this would mean duplication of the 
> same constants?

Not necessarily. Consider:

class A(object):
    spam = 23
    def __init__(self):
        self.eggs = 42

In this case, the "spam" attribute is on the class, not the instance, 
and so it doesn't matter how many A instances you have, there is only 
one reference to 23 and a single copy of 23.

The "eggs" attribute is on the instance. That means that each instance 
has its own separate reference to 42. 

Does that mean a separate copy of 42? Maybe, maybe not. In general, yes: 
if eggs was a mutable object like a list, or a dict, say:

        self.eggs = []

then naturally it would need to be a separate list for each instance. 
(If you wanted a single list shared between all instances, put it on the 
class.) But with immutable objects like ints, strings and floats, there 
is an optimization available to the Python compiler: it could reuse the 
same object. There would be a separate reference to that object per 
instance, but only one copy of the object itself.

Think of references as being rather like C pointers. References are 
cheap, while objects themselves could be arbitrarily large.

With current versions of Python, the compiler will intern and re-use 
small integers and strings which look like identifiers ("alpha" is an 
identifier, "hello world!" is not). But that is subject to change: it is 
not a language promise, it is an implementation optimization.

However, starting with (I think) Python 3.4 or 3.5, Python will optimize 
even more! Instances will share dictionaries, which will save even more 
memory. Each instance has a dict, which points to a hash table of (key, 
value) records:

<instance a of A>
 __dict__ ----> [ UNUSED UNUSED (ptr to key, ptr to value) UNUSED ... ]

<instance b of A>
 __dict__ ----> [ UNUSED UNUSED (ptr to key, ptr to value) UNUSED ... ]

For most classes, the instances a and b will have the same set of keys, 
even though the values will be different. That means the pointers to 
keys are all the same. So the new implementation of dict will optimize 
that case to save memory and speed up dictionary access.

> If so, I thought why not put the constants in memory 
> once, for every instance to access (to reduce memory usage).
> 
> Correct me if I'm wrong in my assumptions (i.e: If instances share stuff).

In general, Python will share stuff if it can, although maybe not 
*everything* it can.

> So I investigated further..
> 
> >>> import sys
> >>> sys.getsizeof(5)
> 12
> 
> 
> So an integer on my machine is 12 bytes.

A *small* integer is 12 bytes. A large integer can be more:

py> sys.getsizeof(2**100)
26
py> sys.getsizeof(2**10000)
1346
py> sys.getsizeof(2**10000000)
1333346

> Now:
> 
> >>> class foo(object):
> ...	def __init__(self):
> ...		pass
> 
> >>> sys.getsizeof(foo)
> 448		
> 
> >>> sys.getsizeof(foo())
> 28
> 
> >>> foo
> <class '__main__.foo'>
> >>> foo()
> <__main__.foo object at 0xXXXXXXX

The *class* Foo is a fairly large object. It has space for a name, a 
dictionary of methods and attributes, a tuple of base classes, a 
table of weak references, a docstring, and more:

py> class Foo(object):
...     pass
...
py> dir(Foo)
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', 
'__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', 
'__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', 
'__qualname__', '__reduce__', '__reduce_ex__', '__repr__', 
'__setattr__', '__sizeof__', '__str__', '__subclasshook__', 
'__weakref__']
py> vars(Foo)
mappingproxy({'__qualname__': 'Foo', '__module__': '__main__', 
'__doc__': None, '__weakref__': <attribute '__weakref__' of 'Foo' 
objects>, '__dict__': <attribute '__dict__' of 'Foo' objects>})
py> Foo.__base__
<class 'object'>
py> Foo.__bases__
(<class 'object'>,)

The instance may be quite small, but of course that depends on how many 
attributes it has. Typically, all the methods live in the class, and are 
shared, while data attributes are per-instance.

> - Second weird thing:
> 
> >>> class bar(object):
> ...	def __init__(self):
> ...		self.w = 5
> ...		self.x = 6
> ...		self.y = 7
> ...		self.z = 8
> 
> >>> sys.getsizeof(bar)
> 448
> >>> sys.getsizeof(foo)
> 448

Nothing weird here. Both your Foo and Bar classes contain the same 
attributes. The only difference is that Foo.__init__ method does 
nothing, while Bar.__init__ has some code in it.

If you call

sys.getsizeof(foo.__init__.__code__)

and compare it to the same for bar, you should see a difference.

> >>> sys.getsizeof(bar())
> 28
> >>> sys.getsizeof(foo())
> 28

In this case, the Foo and Bar instances both have the same size. They 
both have a __dict__, and the Foo instance's __dict__ is empty, while 
the Bar instance's __dict__ has 4 items. Print:

print(foo().__dict__)
print(bar().__dict__)

to see the difference. But with only 4 items, Bar's items will fit in 
the default sized hash table. No resize will be triggered and the sizes 
are the same. Run this little snippet of code to see what happens:

d = {}
for c in "abcdefghijklm":
    print(len(d), sys.getsizeof(d))
    d[c] = None

> Summary questions:
> 
> 1 - Why are foo's and bar's class sizes the same? (foo's just a nop)

Foo is a class, it certainly isn't a NOP. Just because you haven't given 
it state or behaviour doesn't mean it doesn't have any. It has the 
default state and behaviour that all classes start off with.

> 2 - Why are foo() and bar() the same size, even with bar()'s 4 integers?

Because hash tables (dicts) contain empty slots. Once the hash table 
reaches 50% full, a resize is triggered.

> 3 - Why's bar()'s size smaller than the sum of the sizes of 4 integers?

Because sys.getsizeof tells you the size of the object, not the objects 
referred to by the object. Here is a recipe for a recursive getsizeof:

http://code.activestate.com/recipes/577504

-- 
Steve