[Spambayes] proposed changes to hammie & co.
Tim Stone - Four Stones Expressions
tim@fourstonesExpressions.com
Fri Nov 22 15:07:38 2002
11/22/2002 2:58:06 AM, Rob Hooft <rob@hooft.net> wrote:
>T. Alexander Popiel wrote:
>> In message: <w53y97nxxof.fsf@woozle.org>
>> Neale Pickett <neale@woozle.org> writes:
>>
>>>So then, "T. Alexander Popiel" <popiel@wolfskeep.com> is all like:
>>>
>>>
>>>>In message: <w53d6ozzhyt.fsf@woozle.org>
>>>> Neale Pickett <neale@woozle.org> writes:
>>>>
>>>>>I'm currently entwined with mucking the heck out of WordInfo. I've got
>>>>>a neato scheme based on Alex's patch and comments where the WordInfo
>>>>>classes still compute their own probabilities, but also keep a revision
>>>>>number which is compared against a MetaInfo class.
>>>>
>>>>Eww, do we gotta? I thought I was trying to make the DB smaller. ;-)
>>>
>>>Ah, but the only thing *stored* is (spamcount, hamcount). The
>>>probability is calculated the first time you ask for it. If you don't
>>>update nspam or nham, the next time you ask for it it gives the cached
>>>value. So the database is small, but you still get the in-memory
>>>probability caching if you're using a pickle or ZODB.
>>
>>
>> Sounds like there is no caching benefit for one-message-per-invocation
>> situations like running out of procmail, then.
>
>Is this calculation for the few words in one message really
>time-determining? There is another way of caching: Make a dictionary
>that maps count-tuples to spam probabilities.
>
> (1,0) -> 0.155
> (0,1) -> 0.844
>etc.
>
Yeah, this is an interesting idea. Cacheing is the right way to do this, not
pre-calculating, because the tuple count becomes combinatorially large and is
open ended. But... once you've calculated for a given tuple, you shouldn't
have to do it again. The tuple:prob cache *could* be persistent, but I doubt
there's much to be gained by that.
- TimS
>I definitely wouldn't move the calculation into the wordinfo class. It
>is a different task, so it "should" (design) be a separate class....
>
>Rob
*****module probability*****
# assuming probcache is defined somewhere in some initialization
class ProbabilityCache:
def __init__(self)
self.probcache = {}
def prob(self, nham, nspam)
try:
prob = self.probcache[nham][nspam]
except KeyError:
prob = calcprob(nham, nspam)
self.probcache[nham][nspam] = prob
return prob
def calcprob (nham, nspam)
# code moved here from _update_probability in WordInfo class
***************************
....or something of that nature. Maybe Adam Huff's NumPy vectorization stuff
might play well into something like this.
Incidentally, a dictionary of dictionaries has faster lookup than a dictionary
keyed by a constructed tuple.
x = {}
for i in range(500):
x[i] = {}
for j in range (500):
x[i][j] = 1
t1s = time.time()
for k in range(5):
for i in range(500):
for j in range (500):
a = x[i][j]
t1e = time.time()
x={}
for i in range(500):
for j in range (500):
x[(i,j)] = 1
t2s = time.time()
for k in range(5):
for i in range(500):
for j in range (500):
a = x[(i,j)]
t2e = time.time()
print 'test 1 time =',t1e-t1s
print 'test 2 time =',t2e-t2s
*****
Four executions:
test 1 time = 3.41499996185
test 2 time = 4.41600000858
test 1 time = 3.375
test 2 time = 4.28600001335
test 1 time = 3.41500008106
test 2 time = 4.18599998951
test 1 time = 3.46500003338
test 2 time = 4.23699998856
- TimS
>
>
>--
>Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com
More information about the Spambayes
mailing list