[Tutor] Python performance on Windows system

Sun, 28 Oct 2001 12:32:48 +0100

Hi Joel,

The problem is certainly the disk read/writes.

As you found out, trying to manipulate a large list of integers
consumes a lot of memory. A python integer (an integer object) takes
about 20 bytes (IIRC), so 100M of memory for a list of 5 milion integers
sounds about right. However the integer value itself is only 4 bytes.
So, if you use the array module to store the integers in an array, you
already slash your memory consumption to 20M.

However, there is a nice trick that can reduce the memory consumption a
lot more. It's a variation on the trick you could have used with your
file too.

Note that your file lists all the integers from 1 to end, using 12 bytes
per number (10 chars+'\r'+'\n'). However, to see if number n
is a prime you compute an offset (n*12) and read the line to see if
the number is n (then it's a prime) or 0 then it's not. But, since you
already know the number you're interested in, why store it. You could
just as easily have made a file with either a 1 or a 0 on each line
(costing only three bytes) use n to compute the offset (n*3) and get
either a 0 (not prime) or 1 (prime). Or, just store it as one long
string of 0s and 1s. The cost would have been one byte per number and
the offset for number n would have been n. But, 0 or 1 that's not a
byte, that's a bit. It should be possible to store the result for eight
numbers in 1 byte. Reducing the memory/disk cost from 20/12 bytes per
number to 1 byte per 8 numbers. That's a factor of a hundred.

Now, that's a significant factor. With your current method you'll start
getting into trouble again if you try to do 80M numbers, since you'll
then get in the 1GB limit on your file size. However, if you could
reduce the cost to 1 byte per 8 number, Those same 80M numbers would fit
comfortably in about 10M of memory without any need to use disk access
at all.

Without further ado, here's a class (bitset) that packs the information
you're interested in. It uses the fact that an integer contains 32 bits.
So I can store n bits of information in an array of n/32 integers. I
also include a version of your sieve algorithm that uses the bitset to
store its result.

import math, array, time

class bitset:
    '''A class to hold a bitset of arbitrary length
    The bitset is stored in chunks of 32 bits.'''

    def __init__(self, size):
        self.size = size
        self.data = array.array('l', [0]*int(math.ceil(size/32.0)))

    def setbit(self, index):
        '''set bit at index to true'''
        arrindex, bitindex = divmod(index, 32)
        self.data[arrindex] |= (1<<bitindex)

    def flipbit(self, index):
        '''flip the value at index'''
        arrindex, bitindex = divmod(index, 32)
        self.data[arrindex] ^= (1<<bitindex)

    def unsetbit(self, index):
        '''set bit at index to false'''
        arrindex, bitindex = divmod(index, 32)
        self.data[arrindex] &= ~(1<<bitindex)

    def getbit(self, index):
        '''get the value of bit at index'''
        arrindex, bitindex = divmod(index, 32)
        if self.data[arrindex] & (1<<bitindex):
            return 1
        else:
            return 0

# And your sieve algorithm becomes:

def find_primes(end):
    '''return a bitset of size end

    The bit at index i is true if i is not a prime and false
    if i is a prime
    '''
    bs = bitset(end)  
    endsqrt = math.floor(math.sqrt(end))
    x = 2
    while x <= endsqrt:
        for y in xrange(2*x, end, x):
            bs.setbit(y)
        while 1:
            x += 1
            if not bs.getbit(x):
                break
    return bs

if __name__ == '__main__':
    end = 10000L
    bs = find_primes(end)
    # print the primes
    for i in xrange(1, end):
        if not bs.getbit(i):
            print i,

--

As a final note, I didn't really think all this out myself. The method I
descibe here is almost straight from 'Programming Pearls'
by Jon Bentley.

Hope this helps,

Roeland

> Joel Ricker wrote:
> 
> Hi all,
> 
> I've been away from Python for  a little while and so to stretch my
> programming legs a little bit, I've been working on a script to
> generate prime numbers.  I've used a sieve algorithm and originally a
> list of numbers in an array and have reworked it to use a text file in
> the same way. I've noticed when working with big numbers, say finding
> all prime numbers between 1 and 5 million, my system becomes a little
> less responsive.  It was real bad when I used a large array since
> sometimes the memory used by python would run into the 100 meg range
> but since I've switched to using a file it has improved alot but its
> still there.
> 
>  Basically what happens is while the script is running, other windows
> are a little slow to appear and starting new programs or closing
> windows takes several seconds to hapen.  I'm running Windows 2000,
> with a 750 mhz processor and 128megs of memory.  What is odd though is
> that using Task Manager, I see that Python isn't using that much CPU
> -- only 1 or 2 percent at the most and very little memory -- about
> 520k.
> 
> Is it all the disk writes that is slowing things down?  Anything I can
> do to my code to help things along?  Or anything I can do to the
> python interpreter itself?  Below is the main part of the code that is
> doing most of the work in the script.  primein.txt is a text file
> containing a list of numbers between 0 and the max number to search
> to.
> 
> Thanks
> Joel
> 
>     f = open('primein.txt','r+')
>     x = 2
>     endsqrt = math.floor(math.sqrt(end))
>     while x <= endsqrt:
>         print x
>         for y in xrange(2, end/x+1):
>             f.seek(x * y * 12)
>             f.write('%10d\n' % 0)
>         f.flush()
>         f.seek(x * 12 + 12)
>         while 1:
>             x = f.readline()
>             if int(x) > 0:
>                 x = int(x)
>                 break

-- 
r.b.rigilink@chello.nl

"Half of what I say is nonsense. Unfortunately I don't know which half"