[Tutor] object representation

Thu Mar 4 15:22:52 CET 2010

spir wrote:
> Hello,
>
> In python like in most languages, I guess, objects (at least composite ones -- I don't know about ints, for instance -- someone knows?) are internally represented as associative arrays. Python associative arrays are dicts, which in turn are implemented as hash tables. Correct?
> Does this mean that the associative arrays representing objects are implemented like python dicts, thus hash tables?
>
> I was wondering about the question because I guess the constraints are quite different:
> * Dict keys are of any type, including heterogeneous (mixed). Object keys are names, ie a subset of strings.
> * Object keys are very stable, typically they hardly change; usually only values change. Dicts often are created empty and fed in a loop.
> * Objects are small mappings: entries are explicitely written in code (*). Dicts can be of any size, only limited by memory; they are often fed by computation.
> * In addition, dict keys can be variables, while object keys rarely are: they are literal constants (*).
>
> So, I guess the best implementations for objects and dicts may be quite different. i wonder about alternatives for objects, in particuliar trie and variants: http://en.wikipedia.org/wiki/Trie, because they are specialised for associative arrays which keys are strings.
>
> denis
>
> PS: Would someone point me to typical hash funcs for string keys, and the one used in python?
>
>   
http://effbot.org/zone/python-hash.htm
> (*) Except for rare cases using setattr(obj,k,v) or obj.__dict__[k]=v.
>   
Speaking without knowledge of the actual code implementing CPython (or 
for that matter any of the other dozen implementations), I can comment 
on my *model* of how Python "works."  Sometimes it's best not to know 
(or at least not to make use of) the details of a particular 
implementation, as your code is more likely to port readily to the next 
architecture, or even next Python version.

I figure every object has exactly three items in it:  a ref count, a 
implementation pointer, and a payload. The payload may vary between 4 
bytes for an int object, and about 4 megabytes for a list of size a 
million.  (And of course arbitrarily large for larger objects).

You can see that these add up to 12 bytes (in version 2.6.2 of CPython) 
for an int by using sys.getsizeof(92).  Note that if the payload refers 
to other objects, those sizes are not included in the getsizeof() 
function result.  So getsizeof(a list of strings) will not show the 
sizes of the strings, but only the list itself.

The payload for a simple object will contain just the raw data of the 
object.  So for a string, it'd contain the count and the bytes.  For 
compound objects that can change in size, it'd contain a pointer to a 
malloc'ed buffer that contains the variable-length data.  The object 
stays put, but the malloc'ed buffer may move as it size grows and 
shrinks.  getsizeof() is smart enough to report not only the object 
itself, but the buffer it references. Note that buffer is referenced by 
only one object, so its lifetime is intimately tied up with the object's.

The bytes in the payload are meaningless without the implementation 
pointer.   That implementation pointer will be the same for all 
instances of a particular type.  It points to a structure that defines a 
particular type (or class).  That structure for an empty class happens 
to be 452 bytes, but that doesn't matter much, as it only appears once 
per class.  The instance of an empty class is only 32 bytes.  Now, even 
that might seem a bit big, so Python offers the notion of slots, which 
reduces the size of each instance, at the cost of a little performance 
and a lot of flexibility.  Still, slots are important, because I suspect 
that's how built-ins are structured, to make the objects so small.

Now, some objects, probably most of the built-ins, are not extensible.  
You can't add new methods, or alter the behavior much.  Other objects, 
such as instances of a class you write, are totally and very flexible.  
I won't get into inheritance here, except to say that it can be tricky 
to derive new classes from built-in types.

So where do associative arrays come in?  One of the builtin types is a 
dictionary, and that is core to much of the workings of Python.  There 
are dictionaries in each class implementation (that 452 bytes I 
mentioned).  And there may be dictionaries in the instances themselves.  
There are two syntaxes to directly access these dictionaries, the "dot" 
notation and the bracket [] notation.  The former is a simple 
indirection through a special member called __dict__.

So the behavior of an object depends on its implementation pointer, 
which points to a  structure.  And parts of that structure ultimately 
point to C code whch does all the actual work.  But most of the work is 
some double- or triple-indirection which ultimately calls code.