[Tutor] How Python handles data (was guess-my-number programme)
Steven D'Aprano
steve at pearwood.info
Tue Sep 27 17:56:48 CEST 2011
Wayne Werner wrote:
> When you do something like this in C:
>
> int x = 0;
> int y = 0;
>
> What you have actually done behind the scenes is allocated two bytes of
> memory(IIRC that's in the C spec, but I'm not 100% sure that it's guaranteed
> to be two bytes). Perhaps they are near each other, say at addresses
> 0xab0fcd and 0xab0fce. And in each of these locations the value of 0 is
> stored.
The amount of memory will depend on the type of the variable. In C, you
have to declare what type the variable will be. The compiler then knows
how much space to allocate for it.
> When you create a variable, memory is allocated, and you refer to that
> location by the variable name, and that variable name always references that
> address, at least until it goes out of scope. So if you did something like
> this:
>
> x = 4;
> y = x;
>
> Then x and y contain the same value, but they don't point to the same
> address.
Correct, at least for languages like C or Pascal that have the "memory
location" model for variables.
> In Python, things are a little bit more ambiguous because everything is an
> object. So if you do this:
No. There is nothing ambiguous about it, it is merely different from C.
The rules are completely straightforward and defined exactly.
Also, the fact that Python is object oriented is irrelevant to this
question. You could have objects stored and referenced at memory
locations, like in C, if the language designer wanted it that way.
> x = 4
> y = x
>
> Then it's /possible/ (not guaranteed) that y and x point to the same memory
> location. You can test this out by using the 'is' operator, which tells you
> if the variables reference the same object:
The second half of your sentence is correct, you can test it with the
'is' operator. But the first half is wrong: given the two assignments
shown, x=4 and y=x, it *is* guaranteed that x and y will both reference
the same object. That is a language promise made by Python: assignment
never makes a copy. So if you have
x = 4
and then you do
y = x
the language *promises* that x and y now are names for the same object.
That is, "x is y" will return True, or id(x) == id(y).
However, what is not promised is the behaviour of this:
x = 4
y = 4
In this case, you are doing two separate assignments where the right
hand side is given by a literal which merely happens to be the same. The
compiler is free to either create two separate objects, both with value
4, or just one. In CPython's case, it reuses some small numbers, but not
larger ones:
>>> x = 4
>>> y = 4
>>> x is y
True
>>> x = 40000
>>> y = 40000
>>> x is y
False
CPython caches the first 100 integers, I believe, although that will
depend on exactly which version of CPython you are using.
The reason for caching small integers is that it is faster to look them
up in the cache than to create a new object each time; but the reason
for only caching a handful of them is that the cache uses memory, and
you wouldn't want billions of integers being saved for a rainy day.
>>>> x = 4
>>>> y = x
>>>> x is y
> True
>
> But this is not guaranteed behavior - this particular time, python happened
> to cache the value 4 and set x and y to both reference that location.
As I've said, this is guaranteed behaviour, but furthermore, you
shouldn't think about objects ("variables") in Python having locations.
Of course, in reality they do, since it would be impossible -- or at
least amazingly difficult -- to design a programming language without
the concept of memory location. But as far as *Python* is concerned,
rather than the underlying engine that makes Python go, variables don't
have locations in an meaningful sense.
Think of objects in Python as floating in space, rather than lined up in
nice rows with memory addresses. From Python code, you can't tell what
address an object is at, and if you can, you can't do anything with the
knowledge.
Some implementations, such as CPython, expose the address of an object
as the id(). But you can't do anything with it, it's just a number. And
other implementations, such as Jython and IronPython, don't do that.
Every object gets a unique number, starting from 1 and counting up. If
an object is deleted, the id doesn't get reused in Jython and IronPython
(unlike CPython).
Unlike the C "memory address" model, Python's model is of "name
binding". Every object can have zero, one, or more names:
print [] # Here, the list has no name.
x = [] # Here, the list has a single name, "x"
x = y = [] # Here, the list has two names, "x" and "y".
In practice, Python uses a dictionary to map names to objects. That
dictionary is exposed to the user using the function globals().
The main differences between "memory location" variables and "name
binding" variables are:
(1) Memory locations are known by the compiler at compile-time, but only
at run-time for name binding languages. In C-like languages, if I say:
x = 42
print x
the compiler knows to store 42 into location 123456 (say), and then have
the print command look at location 123456. But with name-binding, the
compiler doesn't know what location 42 will actually end up at until
run-time. It might be anything.
(2) Memory location variables must be fixed sizes, while name-binding
can allow variables to change size.
(3) Memory location variables must copy on assignment: x = 4; y = x
makes a copy of x to store in y, since x and y are different variables
and therefore different locations. Name-binding though, gives the
language designer a choice to copy or not.
[...]
> One thing that is important to note is that in each of these examples, the
> data types are immutable. In C++ if you have a string and you add to the end
> of that string, that string is still stored in the same location. In Python
> there's this magical string space that contains all the possible strings in
> existence[1] and when you "modify" a string using addition, what you're
> actually doing is telling the interpreter that you want to point to the
> string that is the result of addition, like 'hi' + '!'. Sometimes Python
> stores these as the same object, other times they're stored as different
> objects.
A better way of thinking about this is to say that when you concatenate
two strings:
a = "hello"
b = "world"
text = a + b
Python will build a new string on the spot and then bind the name text
to this new string.
The same thing happens even if you concatenate a string to an existing
string, like this:
text = "hello"
text = text + "world"
Python looks at the length of the existing two strings: 5 and 5,
allocates enough space for 10 letters, then copies letter-by-letter into
the new string.
However, this can be slow for big strings, so CPython (but not Jython
and IronPython) have an optimization that can *sometimes* apply. If
there is only one reference to "hello", and you are concatenating to the
end, then CPython can sneakily re-use the space already there by
expanding the first string, then copying into the end of it. But this is
an implementation-dependent trick, and not something you can rely on.
--
Steven
More information about the Tutor
mailing list