[Spambayes] Possible to merge Spambayes databases?

Skip Montanaro skip at pobox.com
Tue Mar 23 16:19:54 EST 2004


    Roar> Is it possible to merge the Spambayes databases so I don’t have to
    Roar> classify the same things as spam at home that I already have
    Roar> classified as spam at work and vice versa?

Yeah, with a little effort.  Do you already write programs in Python?

Let's set one simple ground rule: Use the new (in CVS) version of
sb_dbexpimp.py.  This implies the interchange format will be a csv file.  It
will start with something like

    436,518
    received:kulnet.kuleuven.ac.be,0,1
    ,0,1
    0.18x),1,0

That is, the first row has just two fields (total numbers of spam and ham)
and the remaining rows have three fields (feature, number of spam, number of
ham).

Now, assuming you've dumped both databases to csv files (call 'em home.csv
and work.csv), this should approximate what you're after:

    import csv
    home = csv.reader(file("home.csv", "rb"))
    work = csv.reader(file("work.csv", "rb"))

    home_spam, home_ham = map(int, home.next())
    work_spam, work_ham = map(int, work.next())

    # sum the feature information in the two csv files
    features = {}
    for reader in home, work:
	for (feature, ns, nh) in reader:
	    s, h = features.get(feature, (0,0))
	    s += int(ns)
	    h += int(nh)
	    features[feature] = (s,h)

    # write out a merged csv file
    merged = csv.writer(file("merged.csv", "wb"))
    merged.writerow((home_spam+work_spam, home_ham+work_ham))
    for feature in features:
	ns, nh = features[feature]
	merged.writerow((feature,ns,nh))

Skip



More information about the Spambayes mailing list