[Spambayes] full o' spaces

Skip Montanaro skip at pobox.com
Fri Mar 7 14:06:07 EST 2003


    Tim> Ya, I noticed that same thing yesterday.  Maybe an "excessive
    Tim> whitespace" clue, or "many single character words" clue, or
    Tim> something like that?

I tried the ratio of spaces to the total number of characters in the message
body, but that is inconclusive:

    >>> db = shelve.open("../hammie.db", "r")
    >>> for k in db.keys():
    ...   if k.startswith("space ratio"):
    ...     print k, db[k]
    ... 
    space ratio: 0.0 (1240, 399)
    space ratio: 0.1 (3950, 6603)
    space ratio: 0.2 (1405, 4562)
    space ratio: 0.3 (289, 231)
    space ratio: 0.4 (85, 51)
    space ratio: 0.5 (15, 16)
    space ratio: 0.6 (2, 2)
    space ratio: 0.8 (3, 0)

(Maybe I should be ignoring whitespace at the beginning of lines?)

The diploma message has a space ration of right around 0.5.  I haven't
looked at other messages yet to see what the other messages with similar
ratios looked like.  Maybe the ratio of single-character words to the total
number of words would be better.

Skip



More information about the Spambayes mailing list