[Spambayes] Recently discovered this work.
François Granger
francois.granger@free.fr
Fri Nov 1 17:47:10 2002
Config: MacOS 9.1 MacPython 2.2.2
I started developping on the idea of bayesian filtering by end of
august after reading the article. It took me time (spare time) to
arrive at the point where I had a set of script with most
functionalities needed to do it. I discovered few days ago the work
you have done. I guess I can stop my development because It can't
compare to yours.
Along the time, I discovered two issues.
The email package is fragil at decoding Eudora messages with
enclosure wether I get them by OSA (similar to COM on windows) or by
direct access to the mbox files. I went back to using the rfc822
package instead because it was more robust if less sophisticated. I
don't know if this come from Eudora not being conforming to the
standards.
I downloaded your software and tried to use the tokenizer on my
stored mail messages to understand how it was working. I can't make
it works even modifying it a little. If anyone is interested, I did a
small script to show the issue. If anyone is interested, I can send
both the script and a mail message on wich it hangs.
As a side note I have two more questions.
The current software, as downloaded from SF on Oct 29 seems to be
difficult to use on MacOS 9. I would be interrested in having the
Pop3 proxy version working. The other way of using such a filter
would be to have "plug In" to interract with the various mail
clients. I implemented it in my development and have three plugs in
for mails stored as file, for Eudora and for Entourage. They are not
really nice but the idea is there.
What about multilingual situation. On average, I think I get spam
splitted like this: 80% is english, 12% is french 5% is spanish and
3% is german. Not counting asian ones wich I easily filter on
encoding and strange chars. How this technique would do on such a
situation ? I started to develop a language discriminator in order to
automatically sort by main language and then use frequency databases
for each language. I don't know if this is needed ?
--
Le courrier électronique est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies.
Pour des courriers propres : http://minilien.com/?IXZneLoID0 -
http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html