[Spambayes] Recently discovered this work.

Tim Peters tim.one@comcast.net
Fri Nov 1 20:24:42 2002


[François Granger]
> Config: MacOS 9.1 MacPython 2.2.2

I don't have experience with either, but others here do.  Keep posting here
until they admit it <wink>.

> I started developping on the idea of bayesian filtering by end of
> august after reading the article. It took me time (spare time) to
> arrive at the point where I had a set of script with most
> functionalities needed to do it. I discovered few days ago the work
> you have done. I guess I can stop my development because It can't
> compare to yours.

Unclear:  at least yours worked for you!

> Along the time, I discovered two issues.
>
> The email package is fragil at decoding Eudora messages with
> enclosure wether I get them by OSA (similar to COM on windows) or b=
y
> direct access to the mbox files. I went back to using the rfc822
> package instead because it was more robust if less sophisticated. I
> don't know if this come from Eudora not being conforming to the
> standards.

I don't know either; we would need specific examples.

> I downloaded your software and tried to use the tokenizer on my
> stored mail messages to understand how it was working. I can't make
> it works even modifying it a little. If anyone is interested, I did=
 a
> small script to show the issue. If anyone is interested, I can send
> both the script and a mail message on wich it hangs.

Hangs?  That's hard to imagine -- there aren't any unbounded loops in the
tokenizer.  It could be that a regexp search is taking a very long time,
although I tried to cut the legs off that possibility too.

> As a side note I have two more questions.
>
> The current software, as downloaded from SF on Oct 29 seems to be
> difficult to use on MacOS 9. I would be interrested in having the
> Pop3 proxy version working. The other way of using such a filter
> would be to have "plug In" to interract with the various mail
> clients. I implemented it in my development and have three plugs in
> for mails stored as file, for Eudora and for Entourage. They are no=
t
> really nice but the idea is there.

Sorry, I didn't find a question in there.

> What about multilingual situation. On average, I think I get spam
> splitted like this: 80% is english, 12% is french 5% is spanish  an=
d
> 3% is german.  Not counting asian ones wich I easily filter on
> encoding and strange chars.

There appears no need to special-case Asian spam with this code.  It
generates a bunch of tokens that are virtually unique to Asian spam, and
they quickly get very high spamprobs upon training.  The non-default option

[Tokenizer]
replace_nonascii_chars: True

accelerates learning for Asian spam, but at the cost of replacing *all*
high-bit chars.

> How this technique would do on such a situation ?

Can't say:  you didn't say how much of your ham (non-spam) is English,
French, Spanish and German.  I expect it will work fine, as all those
languages (as opposed to some Asian languages) use whitespace too, and the
tokenizer merely splits on whitespace.  This code is *certainly* better than
I am at distinguishing ham from spam in non-English languages, but that's
not saying much.  Try it!

> I started to develop a language discriminator in order to
> automatically sort by main language and then use frequency database=
s
> for each language. I don't know if this is needed ?

I doubt it's necessary, and somewhat doubt it would even be helpful.  The
tokenizer has no concept of semantics, it's just crunching strings, and
doesn't know beans about English as opposed to anything else.  You may need
more training to get comparable results, or maybe not.  Nobody has tested
this yet.

> --
> Le courrier électronique est un moyen de communication. Les gens =
devraient
> se poser des questions sur les implications politiques des choix (o=
u non
> choix) de leurs outils et technologies.

My personal email classifier was sure your msg was ham:

Spam Score: 2.82082e-007

'*H*'                          0.999999
'*S*'                          6.70774e-009

but some of the French words in your sig had high spamprobs:

'sur'                          0.908163
'les'                          0.969799
'est'                          0.973373

This reflects that I personally get a lot more French spam than French ham.
Your classifier is very likely to score these differently, and that's a
great strength of the system for personal use; do note that it has no idea
these words *are* French.  It doesn't even know they're words, for that
matter <wink>.