Prime number module

Thu Oct 2 14:29:47 EDT 2003

On Wed, 01 Oct 2003 02:35:31 -0400, Lulu of the Lotus-Eaters <mertz at gnosis.cx> wrote:

>|> I was disappointed that gzip only reduces the size by 41%.  I would have
>|> guessed better, given the predominance of 0 bits.
>
>Juha Autero <Juha.Autero at iki.fi> wrote previously:
>|patterns and it's hardly a suprise that bitmap of prime numbers
>|doesn't have lots of repeating bit patterns.
>
>Sure it does.  The bit pattern "0 0 0 0 0 0 0" occurs a WHOLE LOT of
>times :-).  Certainly much more than any other pattern.  After all, my
>10^8/2 bits only contain 5,761,453 ones among them[*]--there's got to be
>a lot of runs of zeros in there :-).
>
>The problem, I think, is that these bit patterns don't fall on byte
>borders as neatly as I would like.  I am pretty sure that I could write
>a custom LZ-style compressor that did better by being bit-oriented
>rather than byte-oriented.  But I'm not going to do that work, 3.8 megs
>is small enough for me.

Hm, a bit sequence 001010010010000100100001001000010000001001 ... could be
encoded in a series of symbols '001', '01', '001', '001', '00001', '001', ...
which are essentially just different string representations for prime first differences,
e.g., (faking in the starting 0)

 >>> prime = [0,2,3,5,7,11,13,17,19,23,29,31]
 >>> ['0'*(prime[i+1]-prime[i])+'1' for i in range(len(prime)-1)]
 ['001', '01', '001', '001', '00001', '001', '00001', '001', '00001', '0000001', '001']

Or, using different concrete symbols,
 >>> [chr(prime[i+1]-prime[i]+ord('a')) for i in range(len(prime)-1)]
 ['c', 'b', 'c', 'c', 'e', 'c', 'e', 'c', 'e', 'g', 'c']

Maybe it would be best to analyze the frequency and assign huffman codes and
pack those, for maximal compression? I guess that is about what zip would do to
the latter (byte-oriented, as you mentioned) symbols.

But actually, I was looking at a modulo 2*3*5*7... mask as a pre-compression step, which
might pack things even tighter, because it eliminates whole structured patterns
from the whole. One could do a bigger modulo too. If I get ambitious, I might
try it. I have an inkling of a multi-step modulo, multi-bitmap check that could go
fast in smaller space, but I have to meet someone now...

[Back next day, didn't post that]

The idea alluded to above is to make a mask using a selected part of the front
of the big bitmap, and noting that its zeroes are guaranteed to repeat in forward
repeats of that mask, so those zeroes don't need actual representation going forward.

I.e., if pfact(p) is the "factorial" product of all primes from p on down, then
if mlen=pfact(p),  [bitmap[i:i+mlen] for i in range(1,len(bitmap):mlen)] will be
(untested) a list of bitmap chunks with guaranteed zeroes everywhere bitmap[1:mlen+1]
had them. I downloaded your bitmap, so it should be easy to confirm...

(Ok, but I reconstituted the theoretical bitmap by including factors of 2 and the
primes 1 & 2 in the following)

===< primask.py >====================================================
def getbitmasks(fpath, nb):
    chunksize = (nb/2+7)//8 # nb/2 since even bits are not in bitfile
    f = file(fpath,'rb')
    bits = []
    def bufbits():
        if len(bits)>=nb: return
        bitchunk = f.read(chunksize) # bigendian bytes w/ bits for odd no's 1,3,5...
        for byte in bitchunk:
            ibyte = ord(byte)
            for b in xrange(8):
                bits.append('.1'[(0x80>>b)&ibyte>0])
                bits.append('.') # add in evens

    bufbits()
    bits[0:2] = ['1','1'] # account for primes 1 & 2 not in bit file at front
    while 1:
        bufbits() # make sure
        if not bits: break
        yield ''.join(bits[:nb])
        bits = bits[nb:]

def primask(p, nlines=5):
    pf = {2:2, 3:6, 5:30, 7:210, 11:2310, 13:30030, 17:510510}[p] # should suffice
    bgen = getbitmasks('primes.bitarray', pf)
    print 'Prime for mask = %s, mask length = %s' % (p, pf)
    for n in xrange(nlines):
        s = bgen.next()
        nz = s.count('.')
        if not n: print 'Precompression possible: (%s-%s zeroes)/%s => %.1f %% of orig' %(
                                pf,nz,pf,100.*(pf-nz)/pf)
        print '%4s: %s' %(nz, len(s)<80 and s or s[:80]+' etc.')

if __name__ == '__main__':
    import sys
    try:
        args = map(int, sys.argv[1:])
        p = args.pop(0)
        nl = args and args.pop(0) or 5
        primask(p, nl)
    except: raise; SystemExit("""
        Usage: primask prime [number_of_lines]
            where prime determines size of mask as product of primes at and below this
            and number of lines is number to print.""")
=====================================================================

This will print segments of the bit map (reconstituted to include all numbers starting with 1,2,3,...).
The segments are the length of a possible mask taken from the front whose non-prime slots are bound
to repeat through the rest, since corresponding slots will be multiples of the primes in the first set.
Since those zeroes repeat, one could remove them from the rest of the bit file, as you did with
even numbers. I print the possible compression based on the number of zeroes in the first segment.
So unless I goofed, ...

(To read primes from masks start with 1,2,3,(.),5,(.) etc)

[11:00] C:\pywk\clp\prime>primask.py 3
Prime for mask = 3, mask length = 6
Precompression possible: (6-2 zeroes)/6 => 66.7 % of orig
   2: 111.1.
      123 5    <<-- (primes represented, etc)
   4: 1...1.
      7   +--11
   4: 1...1.
      |   +--17
      +------13
   4: 1...1.
   5: ....1.

[11:00] C:\pywk\clp\prime>primask.py 5
Prime for mask = 5, mask length = 30
Precompression possible: (30-19 zeroes)/30 => 36.7 % of orig
  19: 111.1.1...1.1...1.1...1.....1.
  23: 1.....1...1.1...1.....1.....1.
  23: 1.....1...1.1.....1...1.....1.
  24: ......1...1.1...1.1...1.......
  25: ......1...1.....1.1.........1.

[11:00] C:\pywk\clp\prime>primask.py 7
Prime for mask = 7, mask length = 210
Precompression possible: (210-163 zeroes)/210 => 22.4 % of orig
 163: 111.1.1...1.1...1.1...1.....1.1.....1...1.1...1.....1.....1.1.....1...1.1.....1. etc.
 175: 1...........1...1.1...1.....1.1.........1.....1.....1.....1.1.....1...1.1....... etc.
 177: 1.........1.1.....1...1.....1.......1...1.1...1...........1.......1...1.......1. etc.
 178: 1.........1.1...1.....1.....1.1...........1...1.....1.......1.........1.......1. etc.
 180: ............1...1.1...1.............1...1.1...1...................1...1.......1. etc.

[11:00] C:\pywk\clp\prime>primask.py 11
Prime for mask = 11, mask length = 2310
Precompression possible: (2310-1966 zeroes)/2310 => 14.9 % of orig
1966: 111.1.1...1.1...1.1...1.....1.1.....1...1.1...1.....1.....1.1.....1...1.1.....1. etc.
2030: 1.....................1.....1.1.....1...1.....1.............1.....1...1.1.....1. etc.
2043: 1...............1.1...1.....1.1.....1.....1.........1.....1...........1......... etc.
2055: ................1.1.........1.1.....1...1.....1.....1.......1.....1...1......... etc.
2064: 1...............1...................1...1.1.........1.................1.......1. etc.

[11:00] C:\pywk\clp\prime>primask.py 13
Prime for mask = 13, mask length = 30030
Precompression possible: (30030-26781 zeroes)/30030 => 10.8 % of orig
26781: 111.1.1...1.1...1.1...1.....1.1.....1...1.1...1.....1.....1.1.....1...1.1.....1. etc.
27216: ................1...........1...........1.................1.1.....1.....1.....1. etc.
27366: ................1.....1.....1.1.........1.1...1...................1.....1.....1. etc.
27444: ................1.............1.....1.....................1.............1....... etc.
27481: 1...................................1.....1...1.............1...........1.....1. etc.

[11:00] C:\pywk\clp\prime>primask.py 17
Prime for mask = 17, mask length = 510510
Precompression possible: (510510-468178 zeroes)/510510 => 8.3 % of orig
468178: 111.1.1...1.1...1.1...1.....1.1.....1...1.1...1.....1.....1.1.....1...1.1.....1. etc.
472790: ..................1.....................1.1...............1...........1.1.....1. etc.
474186: ......................1.......................1.....1.......1.....1...1.1....... etc.
475041: ..................1...........1.....1.......................1................... etc.
475727: ..................1.................1.....1.......................1...1......... etc.

So it appears we could compress that file (imagining it as containing evens also) quite a bit,
by using a (510510+7)//8 => 63814 byte mask as the pattern for squishing zeroes out of the rest.

BTW, I thought primes ran about 10% of all numbers, so 8.3 % seems a little suspicious.
Did I goof, or what is the right percentage? I guess I'll make a dict of all the gap sizes
with their respective counts in the bit map and see, but have to leave this for now ...

>
>FWIW, bzip2 doesn't do much better either.  I'd be kinda surprised at
>this point if another representation of the primes under 10^8 actually
>did better[**].  I proposed listing the gaps as a strategy; but given
>that there are 5.7 million gaps to list, it's hard to beat 3.8 megs.
>Maybe encoding gaps in nibbles-with-escaping would work.  But now I'm
>fond of the speed of the bit array.
>
>Yours, Lulu...
>
>[*] I'm missing two primes versus some other claims.  My odd numbers
>don't include 2; and I called 1 a non-prime.  I don't feel strongly
>about the second matter though.
>
>[**] Well, a MUCH shorter representation is a Python-coded algorithm for
>generating the primes in the first place.  But somehow that seems like
>cheating :-).

Oops, better post this and go...

Regards,
Bengt Richter