Please help with MemoryError

Thu Feb 11 20:50:26 EST 2010

On Thu, 11 Feb 2010 15:39:09 -0800, Jeremy wrote:

> My Python program now consumes over 2 GB of memory and then I get a
> MemoryError.  I know I am reading lots of files into memory, but not 2GB
> worth.

Are you sure?

Keep in mind that Python has a comparatively high overhead due to its 
object-oriented nature. If you have a list of characters:

['a', 'b', 'c', 'd']

there is the (small) overhead of the list structure itself, but each 
individual character is not a single byte, but a relatively large object:

 >>> sys.getsizeof('a')
32

So if you read (say) a 500MB file into a single giant string, you will 
have 500MB plus the overhead of a single string object (which is 
negligible). But if you read it into a list of 500 million single 
characters, you will have the overhead of a single list, plus 500 million 
strings, and that's *not* negligible: 32 bytes each instead of 1.

So try to avoid breaking a single huge strings into vast numbers of tiny 
strings all at once.

> I thought I didn't have to worry about memory allocation in
> Python because of the garbage collector.

You don't have to worry about explicitly allocating memory, and you 
almost never have to worry about explicitly freeing memory (unless you 
are making objects that, directly or indirectly, contain themselves -- 
see below); but unless you have an infinite amount of RAM available of 
course you can run out of memory if you use it all up :)

> On this note I have a few
> questions.  FYI I am using Python 2.6.4 on my Mac.
> 
> 1.    When I pass a variable to the constructor of a class does it copy
> that variable or is it just a reference/pointer?  I was under the
> impression that it was just a pointer to the data. 

Python's calling model is the same whether you pass to a class 
constructor or any other function or method:

x = ["some", "data"]
obj = f(x)

The function f (which might be a class constructor) sees the exact same 
list as you assigned to x -- the list is not copied first. However, 
there's no promise made about what f does with that list -- it might copy 
the list, or make one or more additional lists:

def f(a_list):
    another_copy = a_list[:]
    another_list = map(int, a_list)

> 2.    When do I need
> to manually allocate/deallocate memory and when can I trust Python to
> take care of it? 

You never need to manually allocate memory.

You *may* need to deallocate memory if you make "reference loops", where 
one object refers to itself:

l = []  # make an empty list
l.append(l)  # add the list l to itself

Python can break such simple reference loops itself, but for more 
complicated ones, you may need to break them yourself:

a = []
b = {2: a}
c = (None, b)
d = [1, 'z', c]
a.append(d)  # a reference loop

Python will deallocate objects when they are no longer in use. They are 
always considered in use any time you have them assigned to a name, or in 
a list or dict or other structure which is in use.

You can explicitly remove a name with the del command. For example:

x = ['my', 'data']
del x

After deleting the name x, the list object itself is no longer in use 
anywhere and Python will deallocate it. But consider:

x = ['my', 'data']
y = x  # y now refers to THE SAME list object
del x

Although you have deleted the name x, the list object is still bound to 
the name y, and so Python will *not* deallocate the list.

Likewise:

x = ['my', 'data']
y = [None, 1, x, 'hello world']
del x

Although now the list isn't bound to a name, it is inside another list, 
and so Python will not deallocate it.

> 3.    Any good practice suggestions?

Write small functions. Any temporary objects created by the function will 
be automatically deallocated when the function returns.

Avoid global variables. They are a good way to inadvertently end up with 
multiple long-lasting copies of data.

Try to keep data in one big piece rather than lots of little pieces.

But contradicting the above, if the one big piece is too big, it will be 
hard for the operating system to swap it in and out of virtual memory, 
causing thrashing, which is *really* slow. So aim for big, but not huge.

(By "big" I mean megabyte-sized; by "huge" I mean hundreds of megabytes.)

If possible, avoid reading the entire file in at once, and instead 
process it line-by-line.

Hope this helps,

-- 
Steven