[Spambayes] Ditching WordInfo

Neale Pickett neale@woozle.org
06 Sep 2002 16:50:19 -0700


I hacked up something to turn WordInfo into a tuple before pickling, and
then turn the tuple back into WordInfo right after unpickling.  Without
this hack, my database was 21549056 bytes.  After, it's 9945088 bytes.
That's a 50% savings, not a bad optimization.

So my question is, would it be too painful to ditch WordInfo in favor of
a straight out tuple?  (Or list if you'd rather, although making it a
tuple has the nice side-effect of forcing you to play nice with my
DBDict class).

I hope doing this sort of optimization isn't too far distant from the
goal of this project, even though README.txt says it is :)

Diff attached.  I'm not comfortable checking this in, since I don't
really like how it works (I'd rather just get rid of WordInfo).  But I
guess it proves the point :)

Neale

---8<---
? classifier.pyc
? d
? ham.db
? ham.pickle
? ham.spamoracle
? hammie.pyc
? timtoken.pyc
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.5
diff -u -r1.5 hammie.py
--- hammie.py	6 Sep 2002 20:48:29 -0000	1.5
+++ hammie.py	6 Sep 2002 23:48:34 -0000
@@ -1,7 +1,8 @@
 #! /usr/bin/env python
 
 # A driver for the classifier module.  Currently mostly a wrapper around
-# existing stuff.
+# existing stuff.  Neale Pickett <neale@woozle.org> is the person to
+# blame for this.
 
 """Usage: %(program)s [options]
 
@@ -36,6 +37,7 @@
 import errno
 import anydbm
 import cPickle as pickle
+from types import *
 
 program = sys.argv[0]
 
@@ -69,11 +71,24 @@
 
     def __getitem__(self, key):
         if self.hash.has_key(key):
-            return pickle.loads(self.hash[key])
+            val = pickle.loads(self.hash[key])
+            # XXX: kludge kludge kludge.  There's a more elegant
+            # solution, but this proves the concept for the time being.
+            if type(val) == TupleType \
+                   and len(val) == len(classifier.WordInfo.__slots__):
+                # How does pickle pull this off?
+                w = classifier.WordInfo(0)
+                w.__setstate__(val)
+                val = w
+            return val
         else:
             raise KeyError(key)
 
-    def __setitem__(self, key, val): 
+    def __setitem__(self, key, val):
+        # XXX: This has got to go when the __getitem__ kludge is cleaned
+        # up
+        if isinstance(val, classifier.WordInfo):
+            val = val.__getstate__()
         v = pickle.dumps(val, 1)
         self.hash[key] = v
 
@@ -84,7 +99,7 @@
         k = self.hash.first()
         while k != None:
             key = k[0]
-            val = pickle.loads(k[1])
+            val = self.__getitem__(key)
             if key not in self.iterskip:
                 if fn:
                     yield fn((key, val))

---8<---