[Spambayes] Proposing to drop ignore_redundant_html

Tim Peters tim.one@comcast.net
Sat Oct 26 03:29:34 2002


Proposing to drop the option

    ignore_redundant_html

This has been False by default for a long time, and there are no known
clients.  I used it early in the project, before we stripped HTML tags, else
(at the time) there was no way to get any multipart/alternative msg with a
text/html part to score as ham in the c.l.py tests.

Since then,

A. We strip HTML tags by default (and   character entities --
   that's a change I made recently I probably didn't announce here,
   although I mentioned it often enough <wink>).

B. We know that sometimes multipart/alternative msgs have different
   content in the text/plain and text/html parts, and in particular
   that some spam can be identified only by staring at the HTML part.

C. We no longer count multiple instances of a word in a msg multiple
   times during training.  So if text/html and text/plain parts are
   in fact redundant, training isn't affected by seeing the content
   twice.  It used to be.

IOW, ignore_redundant_html has nothing going for it anymore.