[Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8

Charles Cazabon python@discworld.dyndns.org
Thu, 22 Aug 2002 10:14:32 -0600


This brings up a couple of questions, one related to the theory behind this
Bayesian spam filtering, and one about Python optimization ... apologies in
advance for the long post.

Tim Peters <tim.one@comcast.net> wrote:
> 
> I want to raise a caution here.  Graham pulled his formulas out of thin air,
> and one part of the scoring step is quite dubious.  This requires detail to
> understand.

[detail deleted]
 
>                P(S|X)*P(S|Y)/P(S)
> ---------------------------------------------------
> P(S|X)*P(S|Y)/P(S) + P(not-S|X)*P(not-S|Y)/P(not-S)
> 
> This isn't what Graham computes, though:  the P(S) and P(not-S) terms are
> missing in his formulation.  Given that P(not-S) = 1-P(S), and
> P(not-S|whatever) = 1-P(S|whatever), what he actually computes is
> 
>            P(S|X)*P(S|Y)
> -------------------------------------
> P(S|X)*P(S|Y) + P(not-S|X)*P(not-S|Y)
> 
> This is the same as the Bayesian result only if P(S) = 0.5 (in which case
> all the instances of P(S) and P(not-S) cancel out).  Else it's a distortion
> of the naive Bayesian result.

Is there an easy fix to this problem?  I implemented this in Python after
reading about it on the weekend, and it might explain why my results are not
quite as fabulous as the author noted (I'm getting more false positives than
he claimed he was).  Note that I'm not so good with the above notation; I'm
more at home with plain algebraic stuff :).

But the more interesting Python question:  I'm running into some performance
problems with my implementation.  Details:

The analysis stage of my implementation (I'll refer to it as "spamalyzer" for
now) stores the "mail corpus" and term list on disk.  The mail corpus is two
dictionaries (one for spam, one for good mail), each of which contains two
further dictionaries -- one is the filenames of analyzed messages (one key per
filename, values ignored and stored as 0), and the other is a dictionary
mapping terms to the number of occurrences.  The terms list is a single
dictionary mapping terms to a pair of floats (probability of being spam and
distance from 0.5).

My first try at this used cPickle to store these items, but loading them back
in was excruciatingly slow.  From a lightly loaded P3-500/128MB running Linux
2.2.x, each of these is a separate run of a benchmarking Python script:

Loading corpus
--------------

pickle method: good (1014 files, 289182 terms), spam (156 files, 14089 terms)
in 65.190000000000 seconds.
pickle method: good (1014 files, 289182 terms), spam (156 files, 14089 terms)
in 64.790000000000 seconds.
pickle method: good (1014 files, 289182 terms), spam (156 files, 14089 terms)
in 65.010000000000 seconds.

Loading terms
-------------

pickle method: got 12986 terms in 3.460000000000 seconds.
pickle method: got 12986 terms in 3.470000000000 seconds.
pickle method: got 12986 terms in 3.450000000000 seconds.

For a lark, I decided to try an alternative way of storing the data (and no, I
haven't tried the marshal module directly).  I wrote a function to write the
contents of the dictionary to a text file in the form of Python source, so
that you can re-load the data with a simple "import" command.  To my surprise,
this was significantly faster!  The first import, of course, takes a while, as
the interpreter compiles the .py file to .pyc format, but subsequent runs are
an order of magnitude faster than cPickle.load():

Loading corpus
--------------

[charlesc@charon spamalyzer]$ rm mail_corpus.pyc 

custom method: good (1014 files, 289182 terms), spam (156 files, 14089 terms)
in 194.210000000000 seconds.
custom method: good (1014 files, 289182 terms), spam (156 files, 14089 terms)
in 3.500000000000 seconds.
custom method: good (1014 files, 289182 terms), spam (156 files, 14089 terms)
in 3.260000000000 seconds.
custom method: good (1014 files, 289182 terms), spam (156 files, 14089 terms)
in 3.260000000000 seconds.

Loading terms
-------------

[charlesc@charon spamalyzer]$ rm terms.pyc

custom method: got 12986 terms in 3.110000000000 seconds.
custom method: got 12986 terms in 0.210000000000 seconds.
custom method: got 12986 terms in 0.210000000000 seconds.
custom method: got 12986 terms in 0.210000000000 seconds.


So the big question is, why is my naive "o = __import__ (f, {}, {}, [])" so
much faster than the more obvious "o = cPickle.load (f)"?  And what can I do
to make it faster :).

Charles
-- 
-----------------------------------------------------------------------
Charles Cazabon                           <python@discworld.dyndns.org>
GPL'ed software available at:     http://www.qcc.ca/~charlesc/software/
-----------------------------------------------------------------------