[Spambayes] Recently discovered this work.

François Granger francois.granger@free.fr
Fri Nov 1 17:47:10 2002


Config: MacOS 9.1 MacPython 2.2.2

I started developping on the idea of bayesian filtering by end of 
august after reading the article. It took me time (spare time) to 
arrive at the point where I had a set of script with most 
functionalities needed to do it. I discovered few days ago the work 
you have done. I guess I can stop my development because It can't 
compare to yours.

Along the time, I discovered two issues.

The email package is fragil at decoding Eudora messages with 
enclosure wether I get them by OSA (similar to COM on windows) or by 
direct access to the mbox files. I went back to using the rfc822 
package instead because it was more robust if less sophisticated. I 
don't know if this come from Eudora not being conforming to the 
standards.

I downloaded your software and tried to use the tokenizer on my 
stored mail messages to understand how it was working. I can't make 
it works even modifying it a little. If anyone is interested, I did a 
small script to show the issue. If anyone is interested, I can send 
both the script and a mail message on wich it hangs.

As a side note I have two more questions.

The current software, as downloaded from SF on Oct 29 seems to be 
difficult to use on MacOS 9. I would be interrested in having the 
Pop3 proxy version working. The other way of using such a filter 
would be to have "plug In" to interract with the various mail 
clients. I implemented it in my development and have three plugs in 
for mails stored as file, for Eudora and for Entourage. They are not 
really nice but the idea is there.

What about multilingual situation. On average, I think I get spam 
splitted like this: 80% is english, 12% is french 5% is spanish  and 
3% is german. Not counting asian ones wich I easily filter on 
encoding and strange chars. How this technique would do on such a 
situation ? I started to develop a language discriminator in order to 
automatically sort by main language and then use frequency databases 
for each language. I don't know if this is needed ?


-- 
Le courrier électronique est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies.
Pour des courriers propres : http://minilien.com/?IXZneLoID0 - 
http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html