[Spambayes] idea for tokenizer.crack_filename change

Tim Peters tim.one@comcast.net
Fri, 20 Sep 2002 00:30:53 -0400


[Skip Montanaro]
> It seems to me that base64-encoded, all DOS/Windows executables start
> with (reciting from memory, since I've deleted all viruses and haven't
> received any new ones in the last 15 minutes or so) "TPqAAA" or
> something similar.  Why rely on finding specific file extensions?  They
> can just change.

Well, not often, and the scheme we're working on is supposed to be able to
learn when they do <wink>.  Would you like to write some code to tokenize
this particular bit of Windows Lore?  We currently ignore 100% of the
*content* of MIME sections that don't have text/* type, although the MIME
metadata is tokenized for all MIME sections via

        # Content-{Type, Disposition} and their params, and charsets.
        for x in msg.walk():
            for w in crack_content_xyz(x):
                yield w

in tokenize_headers().