[spambayes-dev] A URL experiment

Wed Dec 31 00:02:11 EST 2003

    Tim> Note that this part of the patch can't be helping much:

    Tim> +             num_pcs = url.count("%")
    Tim> +             if num_pcs:
    Tim> +                 pushclue("url:%d %%s" % num_pcs)

    Tim> That is, raw counts are almost never useful -- if I have a URL in a
    Tim> spam that embeds 40 escapes, that does nothing to indict a URL with
    Tim> 39 (or 41) escapes.  Pumping out log2(a_count) usually does more
    Tim> good.  

I realized that before trying, but not having any raw data upon which to
base things, I left it as-is.  If I enable it I'll look at some results to
see what tokens are actually generated and how they seem to correlate with
ham and spam.  One other possibility would be a sort of "Watership Down"
approach: "1, 2, 3, many" (or something similar - rabbits can't count very
high).  The problem with log2(count) in this situation is there seems to be
a practical limit to how many % signs a URL might have (maybe 50?), so
something that creates buckets using division (counts // 5 ???) might do a
decent job of lumping things together.

I'm off work the next couple of days and have some house guests in from out
of town, so I probably won't look at this much.  I will try to at least
build a database from my current training set using this feature and see how
things shake out.  (Maybe tomorrow morning before everyone's up and about.)

    Tim> I *expect* the approach in my patch would work better, though
    Tim> (generating lots of correlated tokens -- there are good reasons to
    Tim> escape some punctuation characters in URLs, but the only good
    Tim> reason to escape a letter or digit is to obfuscate; let the
    Tim> classifier see these things, and it will learn that on its own, as
    Tim> appropriate, for each escape code; then a URL escaping several
    Tim> letters or digits will get penalized more the more heavily it
    Tim> employs this kind of obfuscation).

My problem with that approach is the stuff the spammers escape can be
essentially random, as in the bogus URL you received.  I think you might get
scads of hapaxes (or at least low-count escapes).  Stuff with high-counts
will be legitimate (%20 and so forth).  Conclusions obviously await some
eyeballing of databases.

    >> (*) Operational question: Given that my training data is somewhat
    >> small at the moment (roughly 1000-1500 each of ham and spam), would I
    >> be better off testing with fewer larger sets (e.g, 5 sets w/ 250 msgs
    >> each) or with more smaller sets (e.g, 10 sets w/ 125 msgs each)?

    Tim> If you ask me <wink>, cross-validation should *always* be done with
    Tim> a minimum of 10 sets, regardless of how much data you have.  There
    Tim> are many reasons for this, from statistical reliability of the
    Tim> grand averages at the end (they're subject to central-limit theorem
    Tim> constraints, and the more sets the more reliable they are, growing
    Tim> with the square root of the # of sets); 

Thanks, I will rebalance my training database to 10 sets and see how that
goes. 

    Tim> Note, though, that cross-validation is modeling the performance of
    Tim> a train-on-everything strategy, and in random time order to boot.

The random time order isn't so important to me at the moment, because all
the messages I'm using are recent (received within the past month or so).
The "train on everything" aspect is more interesting.  I find the
cross-validation tests never perform as well as in real life. ;-)

    Tim> If that's not how you train, the results may be irrelevant to what
    Tim> you'll see in real life.  It should be good enough to weed out
    Tim> really bad ideas-- and highlight really good ones --regardless,
    Tim> though.

There's the rub.  What might be really good ideas at this point will
probably only result in very small changes in performance because the
baseline system is currently so good.

Skip