[Spambayes] Possible to merge Spambayes databases?
Skip Montanaro
skip at pobox.com
Tue Mar 23 16:19:54 EST 2004
Roar> Is it possible to merge the Spambayes databases so I dont have to
Roar> classify the same things as spam at home that I already have
Roar> classified as spam at work and vice versa?
Yeah, with a little effort. Do you already write programs in Python?
Let's set one simple ground rule: Use the new (in CVS) version of
sb_dbexpimp.py. This implies the interchange format will be a csv file. It
will start with something like
436,518
received:kulnet.kuleuven.ac.be,0,1
,0,1
0.18x),1,0
That is, the first row has just two fields (total numbers of spam and ham)
and the remaining rows have three fields (feature, number of spam, number of
ham).
Now, assuming you've dumped both databases to csv files (call 'em home.csv
and work.csv), this should approximate what you're after:
import csv
home = csv.reader(file("home.csv", "rb"))
work = csv.reader(file("work.csv", "rb"))
home_spam, home_ham = map(int, home.next())
work_spam, work_ham = map(int, work.next())
# sum the feature information in the two csv files
features = {}
for reader in home, work:
for (feature, ns, nh) in reader:
s, h = features.get(feature, (0,0))
s += int(ns)
h += int(nh)
features[feature] = (s,h)
# write out a merged csv file
merged = csv.writer(file("merged.csv", "wb"))
merged.writerow((home_spam+work_spam, home_ham+work_ham))
for feature in features:
ns, nh = features[feature]
merged.writerow((feature,ns,nh))
Skip
More information about the Spambayes
mailing list