[spambayes-dev] SpamBayes core_server.py and related bits merged to CVS HEAD

Thu Jun 14 13:53:08 CEST 2007

Hello,

The 100MB files should not be a really big problem because we could 
truncate them to their headers (or just a part of them) and send them as 
normal features to SB.

We should use a remote SB classifier only when the network and 
classifier are trusted. When this is not the case we could do the 
parsing locally, hash the tokens, build the feature vector and send it 
to the remote classifier. This way the local application would not 
disclose sensitive information.

Another important problem is related to using a single classifier for 
several application (possibly with a totally different content). IMHO 
the spam for an application might be totally different then the spam of 
another, or to be more exact: the ham/spam features might differ. In 
this case the result of the classification might not be relevant. "My 
SPAM is not your SPAM" :)

M.

skip at pobox.com wrote:
>     skip> For Reimar and Marian (the MoinMoin gurus), I did a very little
>     skip> bit of performance testing.  Roundtrip performance on my laptop
>     skip> (Mac PowerBook G4 - 800MHz) with both the server and client
>     skip> running on the same machine ranged anywhere from 10-50 bytes/ms.
>     skip> When I added a large payload (a MIME encoded JPEG file of 9.5MB)
>     skip> performance in terms of bytes/ms shot way up, but as you would
>     skip> imagine overall time did as well.  Here are some figures:
>
>     skip>     attachment     time          bytes/ms
>     skip>        size
>     skip>     9587824        30.7 sec      312
>     skip>      975978         3.7 sec      259
>     skip>      114794         0.5 sec      252
>     skip>       28675         0.2 sec      142
>
> I probably should have drawn some inferences from this.  First, if you
> really try to score 100MB payloads (Reimer & Marian suggested that some
> people routinely attach 100MB Word (I think) files to wikis), you're going
> to be disappointed.  Second, although attachments of that size would be
> problematic, since SpamBayes doesn't examine the guts of binary data,
> there's probably nothing wrong with trimming the binary file to a reasonable
> size (< 1MB?) and including that trimmed version in the score request.
>
> Also, note that I've really don't nothing with non-ASCII data to this point.
> I suspect people more familiar with that will see a clear path to sanity
> fairly easily.
>
> Skip
>
>