06 Sep 2002 12:58:33 -0700
So then, Tim Peters <email@example.com> is all like:
> > ...
> > I don't know how big that pickle would be, maybe loading it each time
> > is fine. Or maybe marshalling.)
> My tests train on about 7,000 msgs, and a binary pickle of the database is
> approaching 10 million bytes.
My paltry 3000-message training set makes a 6.3MB (where 1MB=1e6 bytes)
pickle. hammie.py, which I just checked in, will optionally let you
write stuff out to a dbm file. With that same message base, the dbm
file weighs in at a hefty 21.4MB. It also takes longer to write:
Using a database:
Using a pickle:
This is on a PIII at 551.257MHz (I don't know what it's *supposed* to
be, 551.257 is what /proc/cpuinfo says).
For comparison, SpamOracle (currently the gold standard in my mind, at
least for speed) on the same data blazes along:
Its data file, which appears to be a marshalled hash, is 448KB.
However, it's compiled O'Caml and it uses a much simpler tokenizing
algorithm written with a lexical analyzer (ocamllex), so we'll never be
able to outperform it. It's something to keep in mind, though.
I don't have statistics yet for scanning unknown messages. (Actually, I
do, and the database blows the pickle out of the water, but it scores
every word with 0.00, so I'm not sure that's a fair test. ;) In any
case, 21MB per user is probably too large, and 10MB is questionable.
On the other hand, my pickle compressed very well with gzip, shrinking
down to 1.8MB.