[Spambayes] Using mxBeeBase as hammie DB

M.-A. Lemburg mal@lemburg.com
Thu Oct 17 16:42:24 2002


Brad Clements wrote:
> On 17 Oct 2002 at 16:19, M.-A. Lemburg wrote:
> 
> 
>>>What operating system, and how much RAM do you have?
>>
>>SuSE Linux 8 on 1GB RAM. But why would that matter ? The process
>>size is only 4.8MB.
> 
> Two thoughts:
> 
> 1. you ran the test at least once before timing it, so Python and other stuff was probably 
> "still in ram"   Not exactly sure how Linux pages things, but on Windows this statement 
> would most likely be true.

The times come directly from the system's time command and
are user + system times (not wall clock). And yes, things were
most probably still in memory since I always run the tests
a few times and then take the numbers from the last test.

> 2. with less ram, you're more likely to need to throw out something to load Python and 
> stuff (especially on Windows OS).

True.

> I just found the "load time" to be extremely low for a typical office worker box. You don't 
> appear to have a typical box.

Hmm, this is a standard SuSE installation and not even an up-to-date
machine (1.2GHz is only half the speed of today's boxes). I am running
Reiser FS if that makes any difference.

> If your box is typical, is your company hiring?  ;-)

Unfortunately, not. Bad times these days...

> Note I'm not slighting Python, since the load time is a given no matter what. Just 
> wanted to know how you achieved the low load time.

Could be that the file system is using some smart caching
technique which makes the dozens of stat calls at Python
startup time rather fast.

> Regarding the 23 megabytes . well, to run this on an IMAP server supporting 100 
> users. That's a lot of disk space. I realize the context switching from one "user" to the 
> next wouldn't be so bad using a database. If you were using a pickle, argh!

I suppose that you can easily create and use multiple spam
databases, e.g. have a central one for the whole company
which only masks standard spam and then use smaller ones per user
which override the settings in the main one if needed. Sort
of like:

md = open(maindict)
ud = open(userdict)
value = ud.get(key)
if value is None:
     value = md[key]

The database size only increases as more words find their
way into it. I'm not sure, but perhaps it's possible to filter
the entries and remove meaningless ones (those with ~50%
spam level).

No idea. This time I'm a user, not a developer ;-)

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/