[Spambayes] Ditching WordInfo
Neale Pickett
neale@woozle.org
06 Sep 2002 16:50:19 -0700
I hacked up something to turn WordInfo into a tuple before pickling, and
then turn the tuple back into WordInfo right after unpickling. Without
this hack, my database was 21549056 bytes. After, it's 9945088 bytes.
That's a 50% savings, not a bad optimization.
So my question is, would it be too painful to ditch WordInfo in favor of
a straight out tuple? (Or list if you'd rather, although making it a
tuple has the nice side-effect of forcing you to play nice with my
DBDict class).
I hope doing this sort of optimization isn't too far distant from the
goal of this project, even though README.txt says it is :)
Diff attached. I'm not comfortable checking this in, since I don't
really like how it works (I'd rather just get rid of WordInfo). But I
guess it proves the point :)
Neale
---8<---
? classifier.pyc
? d
? ham.db
? ham.pickle
? ham.spamoracle
? hammie.pyc
? timtoken.pyc
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.5
diff -u -r1.5 hammie.py
--- hammie.py 6 Sep 2002 20:48:29 -0000 1.5
+++ hammie.py 6 Sep 2002 23:48:34 -0000
@@ -1,7 +1,8 @@
#! /usr/bin/env python
# A driver for the classifier module. Currently mostly a wrapper around
-# existing stuff.
+# existing stuff. Neale Pickett <neale@woozle.org> is the person to
+# blame for this.
"""Usage: %(program)s [options]
@@ -36,6 +37,7 @@
import errno
import anydbm
import cPickle as pickle
+from types import *
program = sys.argv[0]
@@ -69,11 +71,24 @@
def __getitem__(self, key):
if self.hash.has_key(key):
- return pickle.loads(self.hash[key])
+ val = pickle.loads(self.hash[key])
+ # XXX: kludge kludge kludge. There's a more elegant
+ # solution, but this proves the concept for the time being.
+ if type(val) == TupleType \
+ and len(val) == len(classifier.WordInfo.__slots__):
+ # How does pickle pull this off?
+ w = classifier.WordInfo(0)
+ w.__setstate__(val)
+ val = w
+ return val
else:
raise KeyError(key)
- def __setitem__(self, key, val):
+ def __setitem__(self, key, val):
+ # XXX: This has got to go when the __getitem__ kludge is cleaned
+ # up
+ if isinstance(val, classifier.WordInfo):
+ val = val.__getstate__()
v = pickle.dumps(val, 1)
self.hash[key] = v
@@ -84,7 +99,7 @@
k = self.hash.first()
while k != None:
key = k[0]
- val = pickle.loads(k[1])
+ val = self.__getitem__(key)
if key not in self.iterskip:
if fn:
yield fn((key, val))
---8<---