[Spambayes] Heads up! Tokenizer changes

Tue May 20 22:42:55 EDT 2003

[Meyer, Tony]
> I think we really do need to have some sort of solution for these.  Is
> there someone that knows enough about how messages can be
> malformed/the errors that the email package throws that can put this
> together?

Nope, not even the author.  Email messages have a defined syntax, and the
number of ways to violate the rules is essentially unbounded.  When Barry
(and other contributors) knew of a sensible way to proceed in the face of
errors, the email pkg is already trying to do so under strict=False parsing.

Our needs are specific to what we do.  In the patch I talked about, the
insane MIME structure could very well be a showstopper for many
applications.  In our app, though, we don't really give a rat's ass about
the original MIME, we only want to suck out the words.  If what was intended
to be a plain-text part and an HTML part get smushed together, we really
don't care.  So it's appropriate for us to catch the exception and rework
the message a bit so that the email pkg can tolerate it.

Overall, that's more a matter of seeing what breaks than of prior analysis;
bad structure seems rare, even in spam.  I count about 16 places
message_from_string() is called now.  It's used by mboxutils.get_message()
with basic protections that used to live in tokenizer.py.  That's probably
the best version to build on.  It catches MessageParseError, which is the
base class for all parsing complaints; the BoundaryError subclass caught by
the most recent patch is a special case it would be good to catch there too,
before falling back to a more drastic hack.

> The code should probably be added to message.py.  This would fix
> imapfilter and pop3proxy immediately; I gather that the Outlook plugin
> will also use message.py at some point in the future, as will
> hammiefilter.

Just so there's *some* choke point for asking the email pkg to perform this
vulnerable task.