[Tutor] object representation
davea at ieee.org
Thu Mar 4 15:22:52 CET 2010
> In python like in most languages, I guess, objects (at least composite ones -- I don't know about ints, for instance -- someone knows?) are internally represented as associative arrays. Python associative arrays are dicts, which in turn are implemented as hash tables. Correct?
> Does this mean that the associative arrays representing objects are implemented like python dicts, thus hash tables?
> I was wondering about the question because I guess the constraints are quite different:
> * Dict keys are of any type, including heterogeneous (mixed). Object keys are names, ie a subset of strings.
> * Object keys are very stable, typically they hardly change; usually only values change. Dicts often are created empty and fed in a loop.
> * Objects are small mappings: entries are explicitely written in code (*). Dicts can be of any size, only limited by memory; they are often fed by computation.
> * In addition, dict keys can be variables, while object keys rarely are: they are literal constants (*).
> So, I guess the best implementations for objects and dicts may be quite different. i wonder about alternatives for objects, in particuliar trie and variants: http://en.wikipedia.org/wiki/Trie, because they are specialised for associative arrays which keys are strings.
> PS: Would someone point me to typical hash funcs for string keys, and the one used in python?
> (*) Except for rare cases using setattr(obj,k,v) or obj.__dict__[k]=v.
Speaking without knowledge of the actual code implementing CPython (or
for that matter any of the other dozen implementations), I can comment
on my *model* of how Python "works." Sometimes it's best not to know
(or at least not to make use of) the details of a particular
implementation, as your code is more likely to port readily to the next
architecture, or even next Python version.
I figure every object has exactly three items in it: a ref count, a
implementation pointer, and a payload. The payload may vary between 4
bytes for an int object, and about 4 megabytes for a list of size a
million. (And of course arbitrarily large for larger objects).
You can see that these add up to 12 bytes (in version 2.6.2 of CPython)
for an int by using sys.getsizeof(92). Note that if the payload refers
to other objects, those sizes are not included in the getsizeof()
function result. So getsizeof(a list of strings) will not show the
sizes of the strings, but only the list itself.
The payload for a simple object will contain just the raw data of the
object. So for a string, it'd contain the count and the bytes. For
compound objects that can change in size, it'd contain a pointer to a
malloc'ed buffer that contains the variable-length data. The object
stays put, but the malloc'ed buffer may move as it size grows and
shrinks. getsizeof() is smart enough to report not only the object
itself, but the buffer it references. Note that buffer is referenced by
only one object, so its lifetime is intimately tied up with the object's.
The bytes in the payload are meaningless without the implementation
pointer. That implementation pointer will be the same for all
instances of a particular type. It points to a structure that defines a
particular type (or class). That structure for an empty class happens
to be 452 bytes, but that doesn't matter much, as it only appears once
per class. The instance of an empty class is only 32 bytes. Now, even
that might seem a bit big, so Python offers the notion of slots, which
reduces the size of each instance, at the cost of a little performance
and a lot of flexibility. Still, slots are important, because I suspect
that's how built-ins are structured, to make the objects so small.
Now, some objects, probably most of the built-ins, are not extensible.
You can't add new methods, or alter the behavior much. Other objects,
such as instances of a class you write, are totally and very flexible.
I won't get into inheritance here, except to say that it can be tricky
to derive new classes from built-in types.
So where do associative arrays come in? One of the builtin types is a
dictionary, and that is core to much of the workings of Python. There
are dictionaries in each class implementation (that 452 bytes I
mentioned). And there may be dictionaries in the instances themselves.
There are two syntaxes to directly access these dictionaries, the "dot"
notation and the bracket  notation. The former is a simple
indirection through a special member called __dict__.
So the behavior of an object depends on its implementation pointer,
which points to a structure. And parts of that structure ultimately
point to C code whch does all the actual work. But most of the work is
some double- or triple-indirection which ultimately calls code.
More information about the Tutor