[Spambayes] Tokenizing numbers and money
Rob Hooft
rob@hooft.net
Tue Oct 15 14:58:29 2002
I just scanned through my 250k token list and found that a surprising
number of these are numeric or almost numeric. Here is a random part of
the list:
prob nham nspam token
0.1552 1 0 3601.2
0.1552 1 0 3601.5
0.1552 1 0 3603.6
0.1552 1 0 3604.2
0.8448 0 1 3605
0.0918 2 0 3607
0.1552 1 0 3607.2
0.1552 1 0 3613
0.1552 1 0 3617
0.0918 2 0 3618
0.1552 1 0 3620.
0.8448 0 1 3621
0.1552 1 0 3624.2
0.1552 1 0 3626.5
0.1552 1 0 3627.7
0.1552 1 0 3629
0.1552 1 0 3631
[...]
0.9698 0 7 $65.00
0.8448 0 1 $369.00.
0.9698 0 7 $149.00,
0.9698 0 7 $800,000
0.8448 0 1 $30.00)
0.9587 0 5 $205.00
0.8448 0 1 $.19
0.8448 0 1 $24.00
0.9734 0 8 $800
0.9494 0 4 $37).
0.9587 0 5 $1.70
0.8448 0 1 $50,00
0.8448 0 1 $450.00.
0.9082 0 2 $1,000.00!
0.9494 0 4 $663.90
0.8448 0 1 $30...get
0.8448 0 1 $350,000
0.8448 0 1 $.275,
0.9651 0 6 $185.00
0.1552 1 0 $500,-
0.9651 0 6 $349.95.
0.8448 0 1 $2,000-
[...but also...]
0.9803 0 11 $30.00
0.9938 0 36 $319,210.00
0.9921 0 28 $25.00
0.9884 0 19 $100,000.00
0.9979 0 108 $5,000
0.9002 13 119 $500
0.9755 3 128 $50
0.9843 2 139 $25
[...and...]
0.9921 0 28 $25.00
0.8448 0 1 x5=$25.00.
0.9082 0 2 us$25.00
0.9878 0 18 5=$25.00.
0.9082 0 2 $25.00!
0.9941 0 38 $25.00.
0.9348 0 3 $25.00,
Does anyone believe that "3605" is a real spam clue, and "3607" a real
ham clue? I think collapsing numbers into a few classes might
significantly reduce the size of the database, and actually help the
classification. Even though for someone doing fragrances "4711" may be a
strong ham clue, I think that over the whole this is just adding noise.
How about something like tokens for
num:float (e.g. 3624.2)
num:int (e.g. 3629)
num:intpair (e.g. 439,443)
num:$1 (for amounts between $0.00 and $9.99)
num:$10 (for amounts between $10 and $99.99)
num:$100 (for amounts between $100 and $999.99)
num:$1000 (for amounts between $1k and $10k)
num:$huge (for amounts >$10k)
Each of these might have "logarithm suffixes"? Is this unrealistic?
Currently roughly one in six tokens in my list contains at least 3
digits in a row!
amigo[197]spambayes%% egrep -c ' .*[0-9][0-9][0-9]' balk.dat
44757
amigo[198]spambayes%% wc -l balk.dat
255907 balk.dat
Rob
--
Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/