[spambayes-dev] musings on latest enhancement
Tim Peters
tim.one at comcast.net
Tue Jun 17 12:33:57 EDT 2003
[bill parducci]
> i was browsing through the notes on the latest updates in CVS and came
> across this, which gave me pause:
>
> 'Nonsense' HTML tags are stripped rather than replaced with a space
> (e.g. Wr<!$FS|i|R3$s80sA >inkle Reduc<!$FS|i|R3$s80sA >tion becomes
> "Wrinkle" and "Reduction" rather than "Wr", "inkle", "Reduc" and
> "tion").
>
> does this mean that <stuff> <like> <this> will be igonored?
Yes, by spambayes.
> i wonder if it wouldn't be of value to treat the 'nonsense tags' as a
> tokens (e.g. append the list of tokens to the end of the text being
> scored) in addition to 'removing' them?
You can try it. I doubt it will help; it will certainly bloat database size
due to creating more hapaxes in the presence of this junk.
More information about the spambayes-dev
mailing list