Tim Peters tim.one at comcast.net
Fri Dec 12 14:48:42 EST 2003

[Robert Coe]
> What's a "hapax"?

Short for "hapax legomenon" <wink>.  See the glossary at


    hapax, hapax legomenon
    A word or form occurring only once in a document or corpus. (plural
    is hapax legomena).

It's a standard term in the literature.

> This is a word that appears in only one ham or spam (i.e. probabolity
> of 0.5) so we don't really know what to do with them we need more info.
> (i.e. must appears in more email before saying this is a spam of ham
> word with a given probability of X)

Nope, we don't ignore any words that appear in the training database.  The
by-counting spamprob of a hapax is exactly 0 or exactly 1 (depending on
whether the hapax appeared in a ham or in a spam), but the Bayesian
adjustment drives the by-counting spamprob much closer to 0.5 because the
word has been seen so rarely.  It doesn't drive it *enough* toward 0.5 to
push it into the range of spamprobs we ignore, though.

For example, in the message I'm replying to right now, there was one hapax
(among the significant tokens == those with a spamprob outside 0.4-0.6):

    token             spamprob    #ham   #spam
    -----             --------    ----   -----
    'subject:Watch'   0.844828    0      1

So I've only trained on one msg with "Watch" in the Subject line, and that
happened to be spam.  Because it was seen only once, the by-counting
spamprob was reduced from 1.0 to about 0.84, and that actually left it as
the strongest spam clue in the message.  The overall spam score was
0.000529751, so it didn't have much effect.

