[Spambayes] Table munging defeats SpamBayes

Tim Peters tim.one at comcast.net
Sun Dec 7 15:56:34 EST 2003


[Bill Yerazunis]
>> It's not python-based (i.e. you have to shell out to it) but
>>
>>     lynx -dump -stdin
>>
>> will render from standard input to standard output (essentially HTML
>> --> TXT) and then gracefully exit.
>>
>> It's what I use.  :)

[Mathew Hendry]
> It's a nice idea but lynx doesn't seem to render these tables
> properly. IE, Mozilla and Outlook render it "as intended".
>
> Here's what lynx spits out:
>
> C:\>lynx -stdin -dump < spam.html
>
>    Al
>
>    Co
>    No
>    l
>
> [additional gibberish deleted]

Rendering spam HTML is extraordinaly difficult, especially because spam is a
percentage game and, e.g., spammers don't really care whether it renders as
intended under lynx, or probably even under Mozilla anymore.  If they can
exploit IE/Outlook bugs, they hit the bulk of their potential buyers.  Not
that they *know* they're exploiting bugs -- just like most HTML coders, "if
it shows up right in IE, it must be OK" is all they consider.  Since the
source code for Microsoft's HTML renderers is secret, and HTML has grown so
sprawlingly complex, it's extraordinarily difficult to mimic what IE does in
all cases (and overlooking that "IE" is ambiguous -- there are many versions
of IE with many distinct bugs).

I haven't yet seen enough spam exploiting table tricks to worry about it.
The "white on white" (foreground color close to background color)
text-hiding trick is still a much more common gimmick.  It's not often
effective against this kind of classifier, though, since the spammer can't
cheaply guess words that are hammy to you.  Cases where they stumble into
hammy words by accident still get discussed here as if they were miracles of
directed marketing <0.9 wink>.

> I've attached the table I'm testing with as a FYI, although I'm not
> sure if the mailing list will accept it.

It didn't accept it, but I'm not sure why.  The list options were set to
discard MIME attacments of type text/html, but that's all.  I've disabled
that, since I don't share the hatred of HTML most list admins seems to
suffer.

BTW, Mailman has an option to convert HTML to plain text (which was also
enabled, and which is why you don't see any HTML msgs in the spambayes
archive).  I don't know how it does it, or how faithful a translation it
produces, or how robust it is against intentional extreme obfuscation (only
spammers do that), or ... but it is coded in Python <wink>.




More information about the Spambayes mailing list