[Spambayes] Introducing myself

Robert Woodhead trebor@animeigo.com
Mon Nov 11 18:10:19 2002


>  > If so, assuming the final calc isn't exponential, reducing the lookup
>>  time/resources can be a big win performance-wise.
>
>I don't believe so.  When using a Python dict as "the database", the time
>for scoring a msg is minor compared to the time taken by parsing and
>tokenization, and especially compared to the time just to get the msg *into*
>the system (whether that's file I/O, or socket I/O, or some email pkg's
>programming API, or whatever -- that part is the bottleneck when using a
>dict; when not using a dict, database access time may become a burden, and
>most databases in use here require string keys even if you're working with
>ints -- the database user has to convert the hash code to a string!  Other
>databases (like ZODB) could use ints directly as keys, but they're rare.).

Oh, I'd roll my own, probably using an in-memory hash table scheme. 
If you're hashing to a nice, randomly distributed 32-bit key, you'd 
effectively take the database out of the equation.

I think most of the reason I lean this way is that I'm thinking about 
actual implementations (as opposed to testing), and with bayesian, 
you want to do this as close to each individual user as possible 
(right in the mailreader, via a plugin).  It seems to me that you're 
at the point where testing the effects of data reduction techniques 
would be fruitful.  Once I get up and running on the code (just paid 
the tithe to O'Reilly) I'll test it out.

One thing that occurred to me: now that you have something that seems 
to work pretty well, have you considered backtracking on particular 
features to see how much they contribute; for example, going to a 
trivial state machine parser to spit out tokens?

>
>>  Note that since you have the text of the token before you hash it,
>>  you can keep that around for significant tokens and display it later.
>
>Good point!  I had overlooked that indeed.

Yeah, we old farts ("When I was a lad, the bytes only had 6 bits!") 
have lots of tricks.  We don't so much write code as remember it and 
retype it.

>  > The cost of the hashing is the inevitable collisions, which
>>  blur the probabilities for colliding tokens.
>
>Another cost is obscuring the code.

Not really; it doesn't really matter what the format of a token 
coming out of the parser is, does it?  You might need an extra data 
structure to take care of the hashed token/string token 
correspondences but you only need touch that at the end of the parser 
and in the diagnostic output.

>They can't really defeat this scheme that way.  At best they can hope to
>push msgs into Unsure territory.

That is good enough, because it means the human has to look at it. 
Which is what spammers want to have happen.

>   What constitutes "very hammy" is a
>function of each user's database here, and no generic blob of text is going
>to score high for hamminess everywhere.

True; then it becomes a game of finding generic messages that are 
likely to evaluate as hammy enough to the average recognizer.  And 
the meta-response is to send out multiple emails with differently 
tuned slices of ham.

I hereby, btw, coin the term "Dagwood" (or perhaps it should be 
Wooddag?) to mean an email containing artfully sliced amounts of ham, 
spam, and html condiments.  ;^)

>  > So one possible approach would be to gradually degrade the
>>  significance of a token the further along in the email it is (both
>>  during training and recognition).
>
>I think there is reason to believe that spammers have to get your attention
>early.  OTOH, many pieces of incriminating evidence also live at the end of
>spams ("this is not spam!" blurbs, the explanation that you got this because
>you're on an opt-in list run by one of their "partners", references to
>various state and federal bills, the "unsubscribe me" URL slash address
>harverster, etc).

Might have to be a U-shaped function then.  Or it may turn out that 
ignoring the stuff at the end doesn't cost much but reduces false 
positives on new (legit) mailing lists.  I'm just throwing out ideas 
for possible tests.

>Yup.  Guido suggested that at the start, but that level of HTML analysis
>gets a lot more expensive too.  We'll see.

Well, what you'd need is a hacked HTML renderer that output sets that 
look like (token,size,color,background) and ignored words that were 
too small or hard to read.

>
>BTW, on large tests this system scores about 80 msgs/second on my box,
>including everything (system time, training, I/O, parsing, tokenizing,
>scoring, reporting, recording, and analyzing results -- this is # of msgs
>divided by elapsed wall-clock time).  We could afford to get slower, if
>necessary.

And the machines will get faster.  Eventually.

>  > Beware the One True Path.  There is strength in diversity.
>
>Let a thousand classifiers bloom.  If someone here wants to volunteer the
>effort to try a different approach, that's always been welcome.  But the
>results have been so good sticking to one basic approach that I don't see
>that happening.  We ended up doing one thing exceedingly well, and that's a
>contribution to diversity too, of a kind you may be undervaluing <wink>.

I was somewhat teasing you.

>
>>  Or, as the noted philosopher D. Vader put it, "Don't be too proud of
>>  this technological terror you have created."  As you will recall,
>>  those rebel scum managed to craft a nasty false positive.
>
>I don't view an FP as being as costly as needing to build a new Death Star.
>For goodness sake, this is email we're talking about -- anyone trusting a
>truly critical msg to email is dreaming to begin with.

Unfortunately, in the real world, this happens all too often.  Keep 
in mind that the readers of this list are not the typical users of 
the resulting software techniques.

>Well, it's got no semantic knowledge at all.  It doesn't even know which
>language a msg is written in, let alone what it means, and has no concept of
>"word" beyond "stuff that appears between whitespace".  It's very much
>focused on purely local lexical structure.

OK, I was being fuzzy in my use of semantics and syntactics.  Mea Culpa.

>  > So train it only on what a human would see reading the message.
>
>We get a lot of value out of mining a handful of header lines.  We also get
>a lot of value out of tokenizing embedded "invisible" URLs.  The theme here
>is that we tokenize "what works", and that's driven by measured error rates;
>philosophy doesn't enter into that part.

Well, I'm thinking of the metagame.  What are the spammer responses 
to a truly effective bayesian filter?  Obviously, remove those 
features that are typical of spam.  What features cannot be removed 
without making the spam useless as a commercial message?  The actual 
words visible to the reader.

This is what led me to decide, in my testing, to use a simple parser 
that extracted alphanumerics with a few permitted interior 
punctuation characters (like . and '), and which handled tokens with 
interior comments properly.

An interesting test would be to train the system, then run a test 
with a parser that only outputs the simple tokens (simulating a 
spammer response) and see how well it does.

>I have no real idea, but fear that presuming "yes" is presuming a lot of
>intelligence that systems parsing this header won't actually have.  The
>fancier the rating scheme the fancier they have to be too.  In the end, the
>user has to decide what to do about everything that's not called ham, no
>matter how many or few the non-ham categories.  As a user myself, I've got
>no use at all for distinctions beyond "I'm pretty sure it's spam" and "beats
>me".  That already gives two categories I have to check, and that's enough.
>I do find it useful that my client can sort on the score metadata, and there
>are proposals here too to add fancier header lines beyond the basic
>spam/ham/unsure one.

Fair enough.  Optional fancier header lines would do the job as well.

>  > Murphy's Law guarantees that it will happen.  In fact, it typically
>>  happens (in my painful personal experience) soon  after you make
>>  comments like the above.
>
>You realize you're overselling badly here, right <wink>?

If anything, the opposite. <smirk>!

>This is akin to my "entire Nigerian scam quote" FP, and it's all but certain
>that the spam content would overwhelm the brief "from the boss" clues.
>OTOH, if my boss didn't wait for my reply and went ahead and invested
>anyway, the subsequent financial disgrace would open the door for me to take
>his job.  After all, he relied on me for advice, so who more logical to
>succeed him?

Unfortunately, he invested your pension money.  Ooops.  ;^)

R

-- 

Woodhead's Law: "The further you are from your server,  the more likely
it is to crash."



More information about the Spambayes mailing list