[Spambayes] Eliminating many malformed spams

Steve Holden sholden at holdenweb.com
Thu Jun 19 13:30:41 EDT 2003


Hi. Time for my latest observation, which is that while spambayes is
great (especially when integrated with Outlook, for this Windows user)
there seem to be a number of automated mailing systems that generate
malformed messages. I am beginning to suspect they may do it
deliberately to foil attempts at despamming. The specific problem that
concerned me most was when a message's headers would contain an HTML
comment, as this appears to be the most frequent malformation, leading
to the following typical error trace:

Deleting and spam training message 'Zero Cost - 2 Plane Tickets and 3
Day Hotel Vacation!!' -  FAILED to create email.message from:
'Received: from SMTP32-FWD by ... \r\nContent-type:
text/html\r\n<!--/ad/2/CD7-->\r\nX-UIDL: 354808483\r\n ...
</body>\r\n</html>\r\n\r\n\n'
pythoncom error: Python error invoking COM method.
Traceback (most recent call last):
  File "C:\Python22\\lib\site-packages\win32com\server\policy.py", line
275, in _Invoke_
    return self._invoke_(dispid, lcid, wFlags, args)
  File "C:\Python22\\lib\site-packages\win32com\server\policy.py", line
280, in _invoke_
    return S_OK, -1, self._invokeex_(dispid, lcid, wFlags, args, None,
None)
  File "C:\Python22\\lib\site-packages\win32com\server\policy.py", line
541, in _invokeex_
    return apply(func, args)
  File "C:\Steve\spambayescvs\spambayes\Outlook2000\addin.py", line 358,
in OnClick
    if train.train_message(msgstore_message, True, self.manager, rescore
= True):
  File "C:\Steve\spambayescvs\spambayes\Outlook2000\train.py", line 46,
in train_message
    stream = msg.GetEmailPackageObject()
  File "C:\Steve\spambayescvs\spambayes\Outlook2000\msgstore.py", line
641, in GetEmailPackageObject
    msg = email.message_from_string(text)
  File "C:\Python22\lib\email\__init__.py", line 52, in
message_from_string
    return Parser(_class, strict=strict).parsestr(s)
  File "C:\Python22\Lib\email\Parser.py", line 75, in parsestr
    return self.parse(StringIO(text), headersonly=headersonly)
  File "C:\Python22\Lib\email\Parser.py", line 62, in parse
    self._parseheaders(root, fp)
  File "C:\Python22\Lib\email\Parser.py", line 128, in _parseheaders
    raise Errors.HeaderParseError(
email.Errors.HeaderParseError: Not a header, not a continuation:
``<!--/ad/2/CD7-->''

Obviously when email.Parser fails on a message it's difficult to state a
general rule for handling it, though it might be nice to be able to
nominate a folder that malformed mails should be sent to. For now,
though, I've found that it's at least helpful to be able to ignore such
HTML comments. The change below does this in my Windows 2.2.2 version,
and the email module still passes all tests. Obviously it would be
better to have this feature controlled by the "strict" switch or
similar, but I was being pragmatic here. I haven't yet attempted the
same changes to CVS, as I'm not currently in touch with the latest and
greatest, and sadly don't have time for Beta testing.

*** 22Parser.py Thu Jun 19 12:24:19 2003
--- Parser.py   Thu Jun 19 12:05:26 2003
***************
*** 124,129 ****
--- 124,132 ----
                  elif lineno == 1 and line.startswith('--'):
                      # allow through duplicate boundary tags.
                      continue
+                 # SH: ignore inappropriately-positioned HTML comments
+                 elif line.startswith("<!--") and line.endswith("-->"):
+                     continue
                  else:
                      raise Errors.HeaderParseError(
                          "Not a header, not a continuation:
``%s''"%line)
Comments?

regards
--
Steve Holden                                 http://www.holdenweb.com/
Python Web Programming                http://pydish.holdenweb.com/pwp/






More information about the Spambayes mailing list