[Spambayes] Tokenizing numbers and money

Rob Hooft rob@hooft.net
Tue Oct 15 14:58:29 2002


I just scanned through my 250k token list and found that a surprising 
number of these are numeric or almost numeric. Here is a random part of 
the list:

prob   nham nspam token
0.1552    1    0 3601.2
0.1552    1    0 3601.5
0.1552    1    0 3603.6
0.1552    1    0 3604.2
0.8448    0    1 3605
0.0918    2    0 3607
0.1552    1    0 3607.2
0.1552    1    0 3613
0.1552    1    0 3617
0.0918    2    0 3618
0.1552    1    0 3620.
0.8448    0    1 3621
0.1552    1    0 3624.2
0.1552    1    0 3626.5
0.1552    1    0 3627.7
0.1552    1    0 3629
0.1552    1    0 3631
[...]
0.9698    0    7 $65.00
0.8448    0    1 $369.00.
0.9698    0    7 $149.00,
0.9698    0    7 $800,000
0.8448    0    1 $30.00)
0.9587    0    5 $205.00
0.8448    0    1 $.19
0.8448    0    1 $24.00
0.9734    0    8 $800
0.9494    0    4 $37).
0.9587    0    5 $1.70
0.8448    0    1 $50,00
0.8448    0    1 $450.00.
0.9082    0    2 $1,000.00!
0.9494    0    4 $663.90
0.8448    0    1 $30...get
0.8448    0    1 $350,000
0.8448    0    1 $.275,
0.9651    0    6 $185.00
0.1552    1    0 $500,-
0.9651    0    6 $349.95.
0.8448    0    1 $2,000-
[...but also...]
0.9803    0   11 $30.00
0.9938    0   36 $319,210.00
0.9921    0   28 $25.00
0.9884    0   19 $100,000.00
0.9979    0  108 $5,000
0.9002   13  119 $500
0.9755    3  128 $50
0.9843    2  139 $25
[...and...]
0.9921    0   28 $25.00
0.8448    0    1 x5=$25.00.
0.9082    0    2 us$25.00
0.9878    0   18 5=$25.00.
0.9082    0    2 $25.00!
0.9941    0   38 $25.00.
0.9348    0    3 $25.00,

Does anyone believe that "3605" is a real spam clue, and "3607" a real 
ham clue? I think collapsing numbers into a few classes might 
significantly reduce the size of the database, and actually help the 
classification. Even though for someone doing fragrances "4711" may be a 
strong ham clue, I think that over the whole this is just adding noise.

How about something like tokens for

    num:float     (e.g. 3624.2)
    num:int       (e.g. 3629)
    num:intpair   (e.g. 439,443)
    num:$1        (for amounts between $0.00 and $9.99)
    num:$10       (for amounts between $10 and $99.99)
    num:$100      (for amounts between $100 and $999.99)
    num:$1000     (for amounts between $1k and $10k)
    num:$huge     (for amounts >$10k)

Each of these might have "logarithm suffixes"? Is this unrealistic? 
Currently roughly one in six tokens in my list contains at least 3 
digits in a row!

amigo[197]spambayes%% egrep -c ' .*[0-9][0-9][0-9]' balk.dat
44757
amigo[198]spambayes%% wc -l balk.dat
  255907 balk.dat


Rob
-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/