[Spambayes] Heads up! Tokenizer changes
Tim Peters
tim.one at comcast.net
Tue May 20 22:42:55 EDT 2003
[Meyer, Tony]
> I think we really do need to have some sort of solution for these. Is
> there someone that knows enough about how messages can be
> malformed/the errors that the email package throws that can put this
> together?
Nope, not even the author. Email messages have a defined syntax, and the
number of ways to violate the rules is essentially unbounded. When Barry
(and other contributors) knew of a sensible way to proceed in the face of
errors, the email pkg is already trying to do so under strict=False parsing.
Our needs are specific to what we do. In the patch I talked about, the
insane MIME structure could very well be a showstopper for many
applications. In our app, though, we don't really give a rat's ass about
the original MIME, we only want to suck out the words. If what was intended
to be a plain-text part and an HTML part get smushed together, we really
don't care. So it's appropriate for us to catch the exception and rework
the message a bit so that the email pkg can tolerate it.
Overall, that's more a matter of seeing what breaks than of prior analysis;
bad structure seems rare, even in spam. I count about 16 places
message_from_string() is called now. It's used by mboxutils.get_message()
with basic protections that used to live in tokenizer.py. That's probably
the best version to build on. It catches MessageParseError, which is the
base class for all parsing complaints; the BoundaryError subclass caught by
the most recent patch is a special case it would be good to catch there too,
before falling back to a more drastic hack.
> The code should probably be added to message.py. This would fix
> imapfilter and pop3proxy immediately; I gather that the Outlook plugin
> will also use message.py at some point in the future, as will
> hammiefilter.
Just so there's *some* choke point for asking the email pkg to perform this
vulnerable task.
More information about the Spambayes
mailing list