[Spambayes] proposed changes to hammie & co.

Fri Nov 22 15:07:38 2002

11/22/2002 2:58:06 AM, Rob Hooft <rob@hooft.net> wrote:

>T. Alexander Popiel wrote:
>> In message:  <w53y97nxxof.fsf@woozle.org>
>>              Neale Pickett <neale@woozle.org> writes:
>> 
>>>So then, "T. Alexander Popiel" <popiel@wolfskeep.com> is all like:
>>>
>>>
>>>>In message:  <w53d6ozzhyt.fsf@woozle.org>
>>>>             Neale Pickett <neale@woozle.org> writes:
>>>>
>>>>>I'm currently entwined with mucking the heck out of WordInfo.  I've got
>>>>>a neato scheme based on Alex's patch and comments where the WordInfo
>>>>>classes still compute their own probabilities, but also keep a revision
>>>>>number which is compared against a MetaInfo class.
>>>>
>>>>Eww, do we gotta?  I thought I was trying to make the DB smaller. ;-)
>>>
>>>Ah, but the only thing *stored* is (spamcount, hamcount).  The
>>>probability is calculated the first time you ask for it.  If you don't
>>>update nspam or nham, the next time you ask for it it gives the cached
>>>value.  So the database is small, but you still get the in-memory
>>>probability caching if you're using a pickle or ZODB.
>> 
>> 
>> Sounds like there is no caching benefit for one-message-per-invocation
>> situations like running out of procmail, then.  
>
>Is this calculation for the few words in one message really 
>time-determining? There is another way of caching: Make a dictionary 
>that maps count-tuples to spam probabilities.
>
>  (1,0) -> 0.155
>  (0,1) -> 0.844
>etc.
>
Yeah, this is an interesting idea.  Cacheing is the right way to do this, not 
pre-calculating, because the tuple count becomes combinatorially large and is 
open ended.  But... once you've calculated for a given tuple, you shouldn't 
have to do it again.  The tuple:prob cache *could* be persistent, but I doubt 
there's much to be gained by that.

- TimS

>I definitely wouldn't move the calculation into the wordinfo class. It 
>is a different task, so it "should" (design) be a separate class....
>
>Rob

*****module probability*****
# assuming probcache is defined somewhere in some initialization
class ProbabilityCache:
    def __init__(self)
    self.probcache = {}

    def prob(self, nham, nspam)
        try:
            prob = self.probcache[nham][nspam]
        except KeyError:
            prob = calcprob(nham, nspam)
            self.probcache[nham][nspam] = prob

        return prob

def calcprob (nham, nspam)
    # code moved here from _update_probability in WordInfo class
***************************

....or something of that nature.  Maybe Adam Huff's NumPy vectorization stuff 
might play well into something like this.

Incidentally, a dictionary of dictionaries has faster lookup than a dictionary 
keyed by a constructed tuple.

x = {}
for i in range(500):
    x[i] = {}
    for j in range (500):
        x[i][j] = 1
t1s = time.time()
for k in range(5):
    for i in range(500):
        for j in range (500):
            a = x[i][j]
t1e = time.time()

x={}
for i in range(500):
    for j in range (500):
        x[(i,j)] = 1
t2s = time.time()
for k in range(5):
    for i in range(500):
        for j in range (500):
            a = x[(i,j)]
t2e = time.time()

print 'test 1 time =',t1e-t1s
print 'test 2 time =',t2e-t2s
*****
Four executions:
test 1 time = 3.41499996185
test 2 time = 4.41600000858

test 1 time = 3.375
test 2 time = 4.28600001335

test 1 time = 3.41500008106
test 2 time = 4.18599998951

test 1 time = 3.46500003338
test 2 time = 4.23699998856

- TimS

>
>
>-- 
>Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com