What other languages use the same data model as Python?

Mon May 9 10:29:04 EDT 2011

On Mon, 09 May 2011 12:52:27 +1200, Gregory Ewing wrote:

> Steven D'Aprano wrote:
> 
>> Since you haven't explained what you think is happening, I can only
>> guess.
> 
> Let me save you from guessing. I'm thinking of a piece of paper with a
> little box on it and the name 'a' written beside it. There is an arrow
> from that box to a bigger box.
> 
>                             +-------------+
>       +---+                 |             |
>     a | --+---------------->|             |
>       +---+                 |             |
>                             +-------------+
> 
> There is another little box labelled 'b'. After executing 'a = b', both
> little boxes have arrows pointing to the same big box.
[...]
> In this model, a "reference" is an arrow. Manipulating references
> consists of rubbing out the arrows and redrawing them differently.

All very good, but that's not what takes place at the level of Python 
code. It's all implementation. I think Hans Georg Schaathun made a good 
objection to the idea that "Python has references":

    In Pascal a pointer is a distinct data type, and you can 
    have variables of a given type or of type pointer to that 
    given type. That makes the pointer a concrete concept 
    defined by the language.

The same can't be said of "references" in Python. It's not part of Python 
the language, although it might be part of Python's implementation.

> Also
> in this model, a "variable" is a little box. It's *not* the same thing
> as a name; a name is a label for a variable, not the variable itself.

That's essentially the same model used when dealing with pointers. I've 
used it myself, programming in Pascal. The "little box" named a or b is 
the pointer variable, and the "big box" is the data that the pointer 
points to.

It's not an awful model for Python: a name binding a = obj is equivalent 
to sticking a reference (a pointer?) in box a that points to obj. 
Certainly there are advantages to it.

But one problem is, the model is ambiguous with b = a. You've drawn 
little boxes a and b both pointing to the big box (which I deleted for 
brevity). But surely, if a = 1234 creates a reference from a to the big 
box 1234, then b = a should create a reference from b to the box a?

                             +-------------+
       +---+                 |             |
     a | --+---------------->|             |
       +---+                 |             |
         ^                   +-------------+
         |
       +-|-+
     b | | |
       +---+

which is the reference (pointer) model as most people would recognise it. 
That's how it works in C and Pascal (well, at least with the appropriate 
type declarations). To get b pointing to the big box, you would need an 
explicit dereference: "b = whatever a points to" rather than "b = a".

Of course, both of these concepts are models, which is another word for 
"lies" *wink*. Your model is closer to what the CPython implementation 
actually does, using actual C pointers, except of course you do need to 
dereference the pointers appropriately. One of my objections to it is not 
that it is wrong (all models are wrong) but that it will mislead some 
people to reason incorrectly about Python's behaviour, e.g. that b now 
points to the little box a, and therefore if you change what a points to, 
b will follow along. The whole "call by reference" thing. I suppose you 
might argue that you're not responsible for the misunderstandings of 
blinkered C coders *wink*, and there's something to that.

But there's another objection... take, say, the line of Python code:

n = len('hello world')

I can identify the little box "n", which ends up pointing to the big box 
holding int 11; another little box "len", which points to a big box 
holding a function; and a third big box holding the string 'hello world'. 
But where is its little box?

If len were written in pure Python, then *inside* len's namespace there 
would be a local little box for the argument. I expect that there is an 
analogous local little box for built-in functions too. But I don't care 
what happens inside len. What about outside len? Where's the little box 
pointing to 'hello world'?

So it seems your model fails to deal with sufficiently anonymous objects. 
I say "sufficiently", because of course your model deals fine with 
objects without names inside, say, lists: the little box there is the 
list slot rather than a named entry in a namespace. 

It's not just literals that your model fails to deal with, it's any 
expression that isn't bound to a little box:

n = len('hello world') + func(y)

func(y) produces a new object, a big box. Where is the little box 
pointing to it?

If we drop down an abstraction layer, we can see where objects live:

>>> code = compile("n = len('hello world') + func(y)", '', 'single')
>>> import dis
>>> dis.dis(code)
  1           0 LOAD_NAME                0 (len)
              3 LOAD_CONST               0 ('hello world')
              6 CALL_FUNCTION            1
              9 LOAD_NAME                1 (func)
             12 LOAD_NAME                2 (y)
             15 CALL_FUNCTION            1
             18 BINARY_ADD
             19 STORE_NAME               3 (n)
             22 LOAD_CONST               1 (None)
             25 RETURN_VALUE

Both the call to len and the call to func push their results onto the 
stack. There's no little box pointing to the result. There's no little 
box pointing to len, or y, or any of the others: there are just names and 
abstract operations LOAD_NAME and friends.

Again, this is just an abstraction layer. Python objects can be huge, 
potentially hundreds of megabytes or gigabytes. No way are they being 
pushed and popped onto a stack, even if the virtual machine gives the 
illusion that they are. For practical reasons, there must be some sort of 
indirection. But that's implementation and not the VM's model.

> It seems that you would prefer to eliminate the little boxes and arrows
> and write the names directly beside the objects:
> 
>                             +-------------+
>                           a |             |
>                             |             |
>                           b |             |
>                             +-------------+
> 
>                                  +-------------+
>                                c |             |
>                                  |             |
>                                  |             |
>                                  +-------------+

That's not a bad model. But again, it's incomplete, because it would 
suggest that the big box should be able to read its own name tags, which 
of course is impossible in Python. But I suppose one might say, if the 
tag is on the outside of the object, the object can't use introspection 
to see it, can it?

But note that this is really your model in disguise: if you imagine the 
name tags are stuck on with a little piece of blutack, and you carefully 
pull on the name and stretch it away, you get a name sitting over here 
with a tiny thread of blutack attaching it to the big box over there, 
just like in your model.

I actually prefer to keep the nature of the mapping between name and 
object abstract: names map to objects. Objects float around in space, 
wherever the interpreter happens to put them, and you can optionally give 
them names. Objects are dumb and don't know their own name, but the 
Python interpreter knows the names. Names are not the only way to ask the 
interpreter for an object: e.g. you can put them in a list, and ask for 
them by position.

If people then ask, how does the interpreter know the names?, I can add 
more detail: names are actually strings in a namespace, which is usually 
nothing more than a dict. Oh, and inside functions, it's a bit more 
complicated still. And so on.

There is a problem with my model of free-floating objects in space: it 
relies on objects being able to be in two places at once, even *inside* 
themselves (you can append a list to itself). If you hate that concept, 
you'll hate my model. But if you're a science fiction fan from way back, 
then you won't have any problem with the idea that objects can be inside 
themselves:

http://www.youtube.com/watch?v=51JtuEa_OPc

Remember: it's just a model, and all models are lies. Abstractions all 
leak. You can only chose where and how badly they break down.

Now, that's a good challenge for your model. Little boxes only point to 
big boxes. So how do you model cycles, including lists that contain 
themselves?

> But what would you do about lists? With little boxes and arrows, you can
> draw a diagram like this:
> 
>       +---+      +---+
>     a | --+----->|   |      +-------------+
>       +---+      +---+      |             |
>                  | --+----->|             |
>                  +---+      |             |
>                  |   |      +-------------+
>                  +---+
>
> (Here, the list is represented as a collection of variables. That's why
> variables and names are not the same thing -- the elements of the list
> don't have textual names.)

But that's wrong! Names (little boxes) can't point to *slots in a list*, 
any more than they can point to other names! This doesn't work:

>>> L = [None, 42, None]
>>> a = L[0]
>>> L[0] = 23
>>> print(a)  # This doesn't work!
23

It's a pity that Python doesn't actually have references. Imagine how 
much time we'd save: all the unproductive time we spend arguing about 
whether Python has references, we could be fixing bugs caused by the use 
of references instead...

> But without any little boxes or arrows, you can't represent the list
> itself as a coherent object. You would have to go around and label
> various objects with 'a[0]', 'a[1]', etc.
> 
>                             +-------------+
>                        a[0] |             |
>                             |             |
>                             |             |
>                             +-------------+
> 
>                                  +-------------+
>                             a[1] |             |
>                                  |             |
>                                  |             |
>                                  +-------------+
> 
> This is not very satisfactory. If the binding of 'a' changes, you have
> to hunt for all your a[i] labels, rub them out and rewrite them next to
> different objects. It's hardly conducive to imparting a clear
> understanding of what is going on, whereas the boxes-and-arrows model
> makes it instantly obvious.

But I wouldn't do it like that. I'd do it like this:

      0        1        2        3        4
    +--------+--------+--------+--------+--------+
  a |        |        |        |        |        |
    |        |        |        |        |        |
    |        |        |        |        |        |
    +--------+--------+--------+--------+--------+

which conveniently happens to be the way Python lists actually are set 
up. More or less.

[...]
> Finally, there's another benefit of considering a reference to be a
> distinct entity in the data model. If you think of the little boxes as
> being of a fixed size, just big enough to hold a reference, then it's
> obvious that you can only bind it to *one* object at a time. Otherwise
> it might appear that you could draw more than one arrow coming out of a
> name, or write the same name next to more than one object.

But that's pretty much an arbitrary restriction. Why are the boxes so 
small? Just because. Why can't you carefully tease the thread of blutack 
apart, into a bifurcated Y shaped thread? Just because.

If objects can be in two places at once, why can't names? Just because.

(Actually, because Guido decrees it so. Also because it would be kinda 
crazy to do otherwise. I can't think of any language that has a many:many 
mapping between names and values.)

-- 
Steven