[Spambayes] defaults vs. chi-square

Tim Peters tim.one@comcast.net
Tue Oct 29 03:58:55 2002


[T. Alexander Popiel [mailto:popiel@wolfskeep.com]
 Sent: Monday, October 14, 2002 6:09 PM]

> It appears to be a systematic error when a mailing list manager
> appends plain text to what should be a base64 encoded segment.
> Bad MLM, no biscuit.  This confuses the MIME decoder. Bad MIME
> decoder, too!

This is ironic <wink>:  it turns out that the cause is that the MIME decoder
was *too* forgiving, in a twisted relevant sense:

> As a sample:
>
> """
> ...
> Content-Type: text/plain
> Content-Transfer-Encoding: base64
> ...
>
> DQpUck1lbG9kaSwgS/1y/WsgbGlua2xpIOdhbP3+bWF5YW4gdmUgYmlydGVrIG1wMyD8IGlu
> ZGlyaXJrZW4gYmlsZSBpbnNhbmxhcv0ga2FocmVkZW4gc/Z6ZGUgbXAzIHNpdGVsZXJpbmUg
> YWx0ZXJuYXRpZiANCm9sYXJhayBzaXpsZXIgaedpbiD2emVubGUgaGF6/XJsYW5t/f50/XIu
> IEhlciB5Yf50YW4gaGVyIGtlc2ltZGVuIG38emlrc2V2ZXJlIGhpdGFwIGVkZWJpbG1layBp
> 52luIHRhc2FybGFubf3+IDEzIEdCIA0KbP1rIGRldiBNcDMgbGlzdGVzaXlsZSBz/W79Zv1u
> ZGEgcmFraXBzaXogb2xhY2FrIP5la2lsZGUgZG9uYXT9bG39/iB2ZSBzaXogbfx6aWtzZXZl
> cmxlcmluIGhpem1ldGluZSBzdW51bG11/nR1ci4gDQpodHRwOi8vd3d3LnRybWVsb2RpLmNv
> bSBhZHJlc2luZGVraSBkZXYgYXL+aXZpbWl6ZGUgc2l6aSBiZWtsZXllbiBlbiBzZXZkafBp
> bml6IHNhbmF05/1sYXL9biBlbiBzZXZkafBpbml6IA0K/mFya/1sYXL9bv0gYmlya2HnIGRh
> a2lrYSBp52luZGUgYmlsZ2lzYXlhcv1u/XphIGluZGlyaW4gdmUga2V5aWZsZSBkaW5sZW1l
> eWUgYmH+bGF5/W4uIA0KDQrdeWkgRfBsZW5jZWxlci4uIA0KaHR0cDovL3d3dy50cm1lbG9k
> aS5jb20NCg0KDQoNCg0K
>
>
> --
> To UNSUBSCRIBE, email to debian-java-request@lists.debian.org
> with a subject of "unsubscribe". Trouble? Contact
> listmaster@lists.debian.org
> """

I tried like hell to provoke this problem with base64 msgs, and couldn't.
It turns that the final "real base64" line was the key:

> aS5jb20NCg0KDQoNCg0K

Because this section didn't happen to need any '=' padding, the base64
decoder didn't know that it was over, and went on to take the entire
remainder of the text as if it were base64 too.  Until it sees a string of
'=' marks, it will accept darned near everything, and simply ignore
characters that don't make sense for base64.  In the end, the error it
raises is due to that treating the remainder of the msg as pseudo-base64 too
leads to an improperly padded base64 string.

I believe I've fixed this now, by falling back to a stricter(!) approach
when the builtin approach fails.

In cases where the base64 section is terminated by a string of '=', the
builtin approach doesn't fail, and in those cases we lose the plain text
part.  If it fails back to the stricter approach, we don't lose the plain
text part.  Perhaps I should lose the plain text part in this case too?

BTW, looks like your example was foreign-language MP3 spam.  It scores like
so for me:

0.99970963814
'*H*' 0.000577077329346
'*S*' 0.99999635361