[Spambayes] idea for tokenizer.crack_filename change

Skip Montanaro skip@pobox.com
Sat, 21 Sep 2002 09:28:00 -0500


    >> It seems to me that base64-encoded, all DOS/Windows executables start
    >> with (reciting from memory, since I've deleted all viruses and
    >> haven't received any new ones in the last 15 minutes or so) "TPqAAA"
    >> or something similar.  Why rely on finding specific file extensions?
    >> They can just change.

    Tim> Well, not often, and the scheme we're working on is supposed to be able to
    Tim> learn when they do <wink>.  Would you like to write some code to tokenize
    Tim> this particular bit of Windows Lore?

I gave it a try, but I'm still suffering with fp/fn rates around 15%, so
anything I see is suspect.  Also, I saw no change.  It's quite possible I
have a bug, but I've also cleaned out obvious viruses from my corpora.  True
spam may have enough indicators elsewhere that this scheme won't help.

Should I just go ahead and checkin my change (it is controlled by a couple
new options, and by default is not enabled) and let y'all point out my bugs?

Skip