***SA:06.30*** Re: [Tutor] Quick question: Is it more efficient to...

dman dman@dman.ddts.net
Wed, 22 May 2002 11:38:39 -0500


--sHrvAb52M6C8blB9
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, May 21, 2002 at 11:13:06AM +0000, alex gigh wrote:
| Hi;
|=20
| In my mail server, I'm doing some sort of spam detection where I have a=
=20
| kill file which is a list of regular expressions and I have to check that=
=20
| the mail doesn't cotain any of these words...

This sounds reasonable.

| Is it more efficient to (A) check the lines one by one as I receive them =
or=20
| (B) save the whole mail in a file and then check the whole mail with the=
=20
| kill file...

Check the whole file at once.  It allows the regex engine to do it's
own optimization stuff and reduces the number of times you invoke the
regex engine.

You still have some more decisions to make :
    o   what will the regexes be matched against?
        o   the whole raw message?
        o   the whole message after MIME decoding?
        o   just certain MIME parts?
        o   all of the above?
        o   some of the above, as specified by the config file?

    o   do you want to allow line-by-line searching or just global
        searching?  if you want line-by-line you may need to traverse
        the file line-by-line (but double check the re module, it may
        have a way to do it in one big search)

    o   is the file one big regex or is it a collection of regexes?
        o   if it's a collection, what separates regexes?  newlines?
        o   if it is multiple regexes, can you join them together to
            make one big regex to pass to the regex engine?

| I have been told that if I use "pickle" I can do some very efficient
| searching...

I don't understand that.  "pickle" is a way to convert an object to a
stream and then reverse that.  It isn't made to do text searching or
serialize a mail message according to the RFCs.  I wouldn't expect the
output of pickling your object to be helpful at all in scanning for
junk mail.

| Which one would be better and why?
=20
First make it work and make the code understandable.  Then go back and
profile it if it is too slow.
=20
| Also... which number (i.e. 500) do I send back if:
|=20
| (1) The user is unknown or domain name isn't accepted
| (2) The mail isn't delivered because contains "unaccepted" word

In RFC 821, section 4.2.1 :

    550 Requested action not taken: mailbox unavailable
            [E.g., mailbox not found, no access]


Send back a 550 along with an explanatory message.  For a real-world
example :

$ telnet dman.ddts.net smtp
Trying 65.107.69.216...
Connected to dman.dman.ddts.net.
Escape character is '^]'.
220 dman.ddts.net ESMTP Exim 4.04 (#10) Wed, 22 May 2002 11:35:42 -0500
ehlo nowhere
250-dman.ddts.net Hello elijah.iteams.org [65.107.69.197]
250-SIZE 52428800
250-8BITMIME
250-PIPELINING
250 HELP
mail from: <anyone@anywhere.com>
250 OK
rcpt to: <dman@dman.ddts.net>
250 Accepted
data
354 Enter message, ending with "." on a line by itself
From: <anyone@anywhere.com>
To: <anyone@anywhere.com>
Subject: $$$ Make Money Fast $$$ !!!

viagra 100% GARANTEE AMAZING FULL REFUND=20
This is not spam
=2E
550-Heuristics guessed that this message was spam:
550 hits=3D14.1 required=3D5.0 trigger=3D11.0


(I've got Spamassassin hooked into exim so that it scans messages at
SMTP time and rejects spammy looking stuff, but blackholes really
spammy looking stuff)

HTH,
-D

--=20

Failure is not an option.  It is bundled with the software.
=20
GnuPG key : http://dman.ddts.net/~dman/public_key.gpg


--sHrvAb52M6C8blB9
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iEYEARECAAYFAjzryY8ACgkQO8l8XBKTpRRjSwCgoyrVerbmJ6hEucj16rlorAn0
NPoAn03GKaWLYoTEbmxKO5Rjec7fXLG9
=FY7B
-----END PGP SIGNATURE-----

--sHrvAb52M6C8blB9--