[Tutor] How Python handles data (was guess-my-number programme)

Tue Sep 27 17:56:48 CEST 2011

Wayne Werner wrote:

> When you do something like this in C:
> 
> int x = 0;
> int y = 0;
> 
> What you have actually done behind the scenes is allocated two bytes of
> memory(IIRC that's in the C spec, but I'm not 100% sure that it's guaranteed
> to be two bytes). Perhaps they are near each other, say at addresses
> 0xab0fcd and 0xab0fce. And in each of these locations the value of 0 is
> stored.

The amount of memory will depend on the type of the variable. In C, you 
have to declare what type the variable will be. The compiler then knows 
how much space to allocate for it.

> When you create a variable, memory is allocated, and you refer to that
> location by the variable name, and that variable name always references that
> address, at least until it goes out of scope. So if you did something like
> this:
> 
> x = 4;
> y = x;
> 
> Then x and y contain the same value, but they don't point to the same
> address.

Correct, at least for languages like C or Pascal that have the "memory 
location" model for variables.

> In Python, things are a little bit more ambiguous because everything is an
> object.  So if you do this:

No. There is nothing ambiguous about it, it is merely different from C. 
The rules are completely straightforward and defined exactly.

Also, the fact that Python is object oriented is irrelevant to this 
question. You could have objects stored and referenced at memory 
locations, like in C, if the language designer wanted it that way.

> x = 4
> y = x
> 
> Then it's /possible/ (not guaranteed) that y and x point to the same memory
> location. You can test this out by using the 'is' operator, which tells you
> if the variables reference the same object:

The second half of your sentence is correct, you can test it with the 
'is' operator. But the first half is wrong: given the two assignments 
shown, x=4 and y=x, it *is* guaranteed that x and y will both reference 
the same object. That is a language promise made by Python: assignment 
never makes a copy. So if you have

x = 4

and then you do

y = x

the language *promises* that x and y now are names for the same object. 
That is, "x is y" will return True, or id(x) == id(y).

However, what is not promised is the behaviour of this:

x = 4
y = 4

In this case, you are doing two separate assignments where the right 
hand side is given by a literal which merely happens to be the same. The 
compiler is free to either create two separate objects, both with value 
4, or just one. In CPython's case, it reuses some small numbers, but not 
larger ones:

 >>> x = 4
 >>> y = 4
 >>> x is y
True
 >>> x = 40000
 >>> y = 40000
 >>> x is y
False

CPython caches the first 100 integers, I believe, although that will 
depend on exactly which version of CPython you are using.

The reason for caching small integers is that it is faster to look them 
up in the cache than to create a new object each time; but the reason 
for only caching a handful of them is that the cache uses memory, and 
you wouldn't want billions of integers being saved for a rainy day.

>>>> x = 4
>>>> y = x
>>>> x is y
> True
> 
> But this is not guaranteed behavior - this particular time, python happened
> to cache the value 4 and set x and y to both reference that location.

As I've said, this is guaranteed behaviour, but furthermore, you 
shouldn't think about objects ("variables") in Python having locations. 
Of course, in reality they do, since it would be impossible -- or at 
least amazingly difficult -- to design a programming language without 
the concept of memory location. But as far as *Python* is concerned, 
rather than the underlying engine that makes Python go, variables don't 
have locations in an meaningful sense.

Think of objects in Python as floating in space, rather than lined up in 
nice rows with memory addresses. From Python code, you can't tell what 
address an object is at, and if you can, you can't do anything with the 
knowledge.

Some implementations, such as CPython, expose the address of an object 
as the id(). But you can't do anything with it, it's just a number. And 
other implementations, such as Jython and IronPython, don't do that. 
Every object gets a unique number, starting from 1 and counting up. If 
an object is deleted, the id doesn't get reused in Jython and IronPython 
(unlike CPython).

Unlike the C "memory address" model, Python's model is of "name 
binding". Every object can have zero, one, or more names:

print []  # Here, the list has no name.
x = []  # Here, the list has a single name, "x"
x = y = []  # Here, the list has two names, "x" and "y".

In practice, Python uses a dictionary to map names to objects. That 
dictionary is exposed to the user using the function globals().

The main differences between "memory location" variables and "name 
binding" variables are:

(1) Memory locations are known by the compiler at compile-time, but only 
at run-time for name binding languages. In C-like languages, if I say:

x = 42
print x

the compiler knows to store 42 into location 123456 (say), and then have 
the print command look at location 123456. But with name-binding, the 
compiler doesn't know what location 42 will actually end up at until 
run-time. It might be anything.

(2) Memory location variables must be fixed sizes, while name-binding 
can allow variables to change size.

(3) Memory location variables must copy on assignment: x = 4; y = x 
makes a copy of x to store in y, since x and y are different variables 
and therefore different locations. Name-binding though, gives the 
language designer a choice to copy or not.

[...]
> One thing that is important to note is that in each of these examples, the
> data types are immutable. In C++ if you have a string and you add to the end
> of that string, that string is still stored in the same location. In Python
> there's this magical string space that contains all the possible strings in
> existence[1] and when you "modify" a string using addition, what you're
> actually doing is telling the interpreter that you want to point to the
> string that is the result of addition, like 'hi' + '!'. Sometimes Python
> stores these as the same object, other times they're stored as different
> objects.

A better way of thinking about this is to say that when you concatenate 
two strings:

a = "hello"
b = "world"
text = a + b

Python will build a new string on the spot and then bind the name text 
to this new string.

The same thing happens even if you concatenate a string to an existing 
string, like this:

text = "hello"
text = text + "world"

Python looks at the length of the existing two strings: 5 and 5, 
allocates enough space for 10 letters, then copies letter-by-letter into 
the new string.

However, this can be slow for big strings, so CPython (but not Jython 
and IronPython) have an optimization that can *sometimes* apply. If 
there is only one reference to "hello", and you are concatenating to the 
end, then CPython can sneakily re-use the space already there by 
expanding the first string, then copying into the end of it. But this is 
an implementation-dependent trick, and not something you can rely on.

-- 
Steven