[Python-checkins] python/nondist/sandbox/spambayes rebal.py,NONE,1.1

tim_one@users.sourceforge.net tim_one@users.sourceforge.net
Sat, 31 Aug 2002 11:52:40 -0700


Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv30297

Added Files:
	rebal.py 
Log Message:
A little script I use to rebalance the ham corpora after deleting what
turns out to be spam.  I have another Ham/reservoir directory with a
few thousand randomly selected msgs from the presumably-good archive.
These aren't used in scoring or training.  This script marches over all
the ham corpora directories that are used, and if any have gotten too
big (this never happens anymore) deletes msgs at random from them, and
if any have gotten too small plugs the holes by moving in random
msgs from the reservoir.


--- NEW FILE: rebal.py ---
import os
import sys
import random

'''
dead = """
Data/Ham/Set1/62902.txt
Data/Ham/Set3/17667.txt
Data/Ham/Set5/129688.txt"""

for f in dead.split():
    os.unlink(f)

sys.exit(0)
'''

NPERDIR = 4000
RESDIR = 'Data/Ham/reservoir'
res = os.listdir(RESDIR)

stuff = []
for i in range(1, 6):
    dir = 'Data/Ham/Set%d' % i
    fs = os.listdir(dir)
    stuff.append((dir, fs))

while stuff:
    dir, fs = stuff.pop()
    if len(fs) > NPERDIR:
        f = random.choice(fs)
        fs.remove(f)
        print "deleting", f, "from", dir
        os.unlink(dir + "/" + f)
        stuff.append((dir, fs))
    elif len(fs) < NPERDIR:
        print "need a new one for", dir
        f = random.choice(res)
        print "How about", f
        res.remove(f)

        fp = file(RESDIR + "/" + f, 'rb')
        guts = fp.read()
        fp.close()

        print guts
        ok = raw_input('good enough? ')
        if ok.startswith('y'):
            fp = file(dir + "/" + f, 'wb')
            fp.write(guts)
            fp.close()
            os.unlink(RESDIR + "/" + f)
            fs.append(f)
            stuff.append((dir, fs))