[Spambayes] Tokenizing numbers and money
Rob Hooft
rob@hooft.net
Wed Oct 16 20:40:45 2002
Tim Peters wrote:
> You can try it, although it fights the "stupid beats smart" meta-rule. It's
> easy to think of examples in the other direction too. For example, I get an
> electronic order receipt with an order number, and a few days later get a
> shipping confirmation referencing the same number. If I trained on the
> order receipt between times, that "senseless number" is certainly going to
> help the shipping confirmation score low.
>
>
>>How about something like tokens for
>>
>> num:float (e.g. 3624.2)
>> num:int (e.g. 3629)
>> num:intpair (e.g. 439,443)
>> num:$1 (for amounts between $0.00 and $9.99)
>> num:$10 (for amounts between $10 and $99.99)
>> num:$100 (for amounts between $100 and $999.99)
>> num:$1000 (for amounts between $1k and $10k)
>> num:$huge (for amounts >$10k)
>>
>>Each of these might have "logarithm suffixes"? Is this unrealistic?
>
>
> It's realistic to try it, but more expensive than the tokenization we do now
> (we do nothing at all for "words" of under 13 chars now except determine
> their length; the split-on-whitespace business goes at C speed).
More expensive, but I didn't notice it yet. First results: It doesn't
make a difference.
cv5: original code
cv8: with "num:XXX" tokens for simple numerics
amigo[109]spambayes%% grep -A1 'all runs' cv5.txt
-> <stat> Ham scores for all runs: 16000 items; mean 0.59; sdev 4.96
-> <stat> min -1.22125e-13; median 1.3603e-11; max 100
--
-> <stat> Spam scores for all runs: 5800 items; mean 99.02; sdev 5.86
-> <stat> min 6.85483e-09; median 100; max 100
amigo[110]spambayes%% grep -A1 'all runs' cv8.txt
-> <stat> Ham scores for all runs: 16000 items; mean 0.60; sdev 5.00
-> <stat> min -1.44329e-13; median 2.66842e-11; max 100
--
-> <stat> Spam scores for all runs: 5800 items; mean 99.04; sdev 5.74
-> <stat> min 7.69111e-09; median 100; max 100
cv8 now has the following tokens:
prob nham nspam token
0.0082 27 0 num:float8
0.0088 25 0 num:signfloat6
0.0122 18 0 num:signfloat5
0.0137 16 0 num:signfloat4
0.0138 657 9 num:signint3
0.0167 13 0 num:signfloat7
0.0197 11 0 num:int12
0.0266 8 0 num:float10
0.0266 8 0 num:signfloat8
0.0302 7 0 num:signfloat9
0.0302 7 0 num:signint6
0.0413 5 0 num:signfloat10
0.0506 4 0 num:signfloat11
0.0868 265 25 num:signint5
0.0911 12 1 num:float9
0.1539 111 20 num:int10
0.1552 1 0 num:expfloat10
0.1552 1 0 num:float12
0.1552 1 0 num:signint9
0.1566 71 13 num:float7
0.1654 11 2 num:signint4
0.2085 255 67 num:int7
0.2248 4 1 num:float11
0.2656 64 23 num:int9
0.2935 164 68 num:float5
0.3138 431 197 num:float3
0.3196 194 91 num:float4
0.3550 151 83 num:int6
0.3596 1900 1067 num:int4
0.4041 65 44 num:int8
0.4255 1218 902 num:int3
0.4369 687 533 num:int5
0.4399 65 51 num:float6
0.4471 5 4 num:signint11
0.7432 4 12 num:int11
0.7752 1 4 num:signint8
0.8133 127 554 num:intpair
0.8356 1 6 num:signint10
0.9082 0 2 num:money12
0.9180 9 103 num:money5
0.9383 24 368 num:money4
0.9587 0 5 num:exclmoney12
0.9587 0 5 num:money9
0.9599 2 53 num:fracmoney9
0.9700 13 428 num:money3
0.9730 1 44 num:money10
0.9734 0 8 num:fracmoney4
0.9762 0 9 num:exclmoney4
0.9785 0 10 num:exclmoney11
0.9785 0 10 num:exclmoney9
0.9788 4 195 num:money8
0.9794 1 58 num:exclmoney8
0.9796 4 203 num:money6
0.9833 0 13 num:money11
0.9863 0 16 num:exclmoney5
0.9900 4 417 num:fracmoney6
0.9904 0 23 num:exclmoney10
0.9912 2 249 num:fracmoney7
0.9920 2 274 num:fracmoney5
0.9933 0 33 num:fracmoney8
0.9937 0 35 num:exclmoney6
0.9954 1 262 num:money7
0.9956 0 51 num:exclmoney7
0.9974 0 86 num:fracmoney11
0.9975 0 89 num:fracmoney10
Dead end? Or is the reduction in number of tokens significant?
Rob
--
Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/