[Spambayes] How does Spambayes deal with those "random" words inspam?

G. Armour Van Horn vanhorn at whidbey.com
Fri Nov 7 13:15:00 EST 2003


This was tested in the early days, and the conclusion was to ignore them. Words
that appear only once in the corpus are known as "hapax legemona" (if that's not
correct it is close) and casually referred to be the testers as "hapaxes". They
have no effect at all, because a hapax will have a score of .5, and the default is
to ignore anything between .4 and .6. The non-random words the spam includes will
have scores that are meaningful.

Leaving the hapaxes in the database doubtless increases the size, but there's
nothing that can be done about that easily. I could speculate that a mechanism to
delete hapaxes that have been in the database for some length of time would cure
this, but that would mean timestamping every entry in the database. You can't just
delete all hapaxes immediately because that would prevent any additional words
entering the database at all.

Van

Parzival wrote:

> Lots of spam contains "random" words in the subject and visible or hidden in
> the message body. Since each spam contains different random strings, these
> words are very likely not to re-appear in subsequent spam. Does this reduce
> the effectiveness of the classification?
>
> A human seeing such a message with garbage words would immediatly recognize it
> as spam. Could the classfier be extended to assign higher spam ratings to
> messages containing a large amount of "words which have never been seen"?
> Possibly a user could pre-seed the classifier with a dictionary of words in
> his/her language and/or jargon.
>
> -- Parzival
>
> _______________________________________________
> Spambayes at python.org
> http://mail.python.org/mailman/listinfo/spambayes
> Check the FAQ before asking: http://spambayes.sf.net/faq.html

--
----------------------------------------------------------
Sign up now for Quotes of the Day, a handful of quotations
on a theme delivered every morning.
Enlightenment! Daily, for free!
mailto:twisted at whidbey.com?subject=Subscribe_QOTD

For web hosting and maintenance,
visit Van's home page: http://www.domainvanhorn.com/van/
----------------------------------------------------------





More information about the Spambayes mailing list