[Tutor] Memory error - how to manage large data sets?

John Fouhy john at fouhy.net
Tue Jul 29 01:48:11 CEST 2008


On 29/07/2008, Daniel Sarmiento <dsarmientos at gmail.com> wrote:
>  I tried to run your code and checked (with top) the memory ussage and
>  it uses more than 2 Gb of memory.
>
>  I tried to modify the code a little bit to use less memory and came up
>  with this:
>
>  fib = {0:0,1:1}
>
> even = []
>
>  def fibonacci(x,y):
>    return x+y
>
> for j in xrange (2,1000000):
>     i = fib[j-1] + fib[j-2]
>     if i % 2 == 0:
>         even.append(i)
>     fib = {j-1:fib[j-1], j:i}
>
>
>  total = reduce(fibonacci,even)
>  print total
>
>  It looks like the progam still hangs and I did not notice any memory
>  imrovements when running it with 1 000 000

Well, let's see.  You're still storing all the even fibonacci numbers
that you compute.  By the looks of things, one third of fibonacci
numbers are even, so that's about 333,333 numbers that we're storing.

How big do these numbers get?  There's a closed-form expression for
Fibonacci numbers; it's: fib(n) = (phi**n - (1-phi)**n)/sqrt(5), where
phi is the golden ratio (about 1.6).  1-phi is -0.6, so when n is
large, (1-phi)**n is practically zero.  So fib(n) is roughly
phi**n/sqrt(5).  These numbers will quickly get beyond the size of
normal integers, and into long integers.  I don't know how many bytes
a long integer takes up, but I guess we can estimate it by looking at
the log to base 2.

So let's consider the millionth Fibonacci number.  fib(1000000) ~=
phi**1000000/sqrt(5).
So log(fib(1000000)) ~= log(phi**1000000/sqrt(5)) = 1000000*log(phi) -
log(sqrt(5)).

Taking logs to base 2, we get:

>>> 1000000*log(phi, 2) - log(sqrt(5), 2)
694240.75266657001

In other words, the best possible representation of the millionth
Fibonacci number will take almost 700,000 bits, or around 85
kilobytes.  I don't know how Python long integers actually work; it
may take up more space than that.

Of course, that's just the biggest one we're working out.  Let's try
to work out how much space the first million will take up.  This is:

sum_{i=1}^{1000000} i*log(phi, 2) - log(sqrt(5), 2)

== -1000000*log(sqrt(5), 2) + log(phi, 2)*sum_{i=1}^{1000000} i

Remembering the formula for summing integers (n(n+1)/2), this is:

>>> -1000000*log(sqrt(5), 2) + log(phi, 2)*(1000000*1000001/2)
347120142972.21808

This is a number of bits; divide by 8 to get bytes; divide by 8*1024
to get kilobytes, etc:

>>> _/(8*1024*1024*1024)
40.410103156722393

So ... that's about 40 gigabytes worth of numbers in this list you're
trying to build.  Well, actually you're only storing a third of the
Fibonacci numbers (the even ones), so we can cut that down to thirteen
gigabytes.  Assuming my maths is correct and there's not too much
overhead in Python long integers :-)

(the solution, of course, is to avoid storing all those numbers in the
first place)

-- 
John.


More information about the Tutor mailing list