[Spambayes] dumb question - why not simply the subject?

David LeBlanc whisper@oz.net
Sat, 28 Sep 2002 20:22:42 -0700


I've gotten spams with subjects like "About your recent order" - not easy to
discern that's spam.

Also, sender's email address isn't reliable either. Those people change
their fake sender's address faster then a politician during election season
changes his campaign platform ;)

David LeBlanc
Seattle, WA USA

> -----Original Message-----
> From: spambayes-bounces+whisper=oz.net@python.org
> [mailto:spambayes-bounces+whisper=oz.net@python.org]On Behalf Of Tim
> Peters
> Sent: Saturday, September 28, 2002 20:13
> To: skip@pobox.com
> Cc: spambayes@python.org
> Subject: RE: [Spambayes] dumb question - why not simply the subject?
>
>
> [Skip Montanaro]
> > It was (I thought obviously) more of a rhetorical question than
> > anything.  I have no problem identifying spam vs non-spam in my own
> > corpus.
>
> I believe you overestimate your own ability to do this based on
> subject line
> alone, due to selective memory of outrageously spammish subject lines.
> That's not a jab at you, it's how humans work.  For example,
> *just* grep for
> the Subject lines in the compilation of false negatives you put
> on the web.
> If you can nail 100% of those not knowing that you already believe they're
> spam, I won't believe you <wink>.  What are you going to do with a subject
> line that says just "HELP", or "hi"?  I get plenty of those in both my
> real-life ham and spam.
>
> You'll also find that you left at least one
>
> Subject: New subscription request to list CEDU from edshipp@ameritech.net
>
> message in your false negatives, which pretty much kills the notion that
> you're infallible <wink>.
>
> > I don't doubt that the current timcv would do better than
> > your average human, especially if you require response times in the
> > millisecond range,
>
> I was kidding about that part.  I'm certain that the error rates
> it displays
> on my corpus are better than I could do by hand, though.
>
> > but I suspect with a human and timcv restricted
> > to examining just the subject the human would win.
>
> That's a different scenario entirely.  To mine subject lines, and only
> subject lines, likely requires vast semantic knowledge to do really well.
> Then you're in the realm of AI, and that's got a track record that speaks
> for itself <wink>.
>
> As an experiment, I changed tokenize() to ignore the body completely, and
> changed tokenize_headers() to do only this:
>
>         x = msg.get('subject', '')
>         if x == '':
>             yield 'No subject'
>             return
>         lastt = ''
>         for w in subject_word_re.findall(x):
>             for t in tokenize_word(w):
>                 yield 'subject:' + t
>                 yield 'subject:' + t.lower()
>                 if lastt:
>                     yield 'subject:' + lastt + '<->' + t
>                 lastt = t
>         for w in x.split():
>             yield 'subject:' + w
>             yield 'subject:' + w.lower()
>         for w in junk_re.findall(x):
>             yield 'subject:' + w
>         return
>
> where
>
>     junk_re = re.compile(r"\W+")
>
> IOW, it goes hog wild on the Subject line alone, doing unigrams
> and bigrams,
> case-folding and not case-folding, splitting on whitespace and
> searching for
> alphanumeric runs, and even tokenizing runs of pure punctuation.
>
> Of course this doesn't do as well as our regular tokenization, but it does
> suggest there's a world of clues in subject lines that we could be getting
> more value from.  Cool:  the first false positive it turned up is actually
> spam that nothing else has caught!
>
> *************************************************************************
> Data/Ham/Set1/156846.txt
> prob = 0.637311663424
> prob('subject:post') = 0.061641
> prob('subject:Link') = 0.0652174
> prob('subject:.') = 0.398331
> prob('subject:inside!') = 0.844828
> prob('subject:fresh') = 0.909196
> prob('subject:Fresh') = 0.909218
> prob('subject:Inside') = 0.929104
> prob('subject:! ') = 0.950089
> prob('subject:free') = 0.983887
> prob('subject:Free') = 0.984495
>
> Path:
> news.baymountain.net!uunet!ash.uu.net!priapus.visi.com!news-out.vi
> si.com!her
> mes.visi.com!newsfeeds-atl2!news.webusenet.com!pc01.webusenet.com!
> e3500-atl1
> .usenetserver.com.POSTED!ijehs.ac.th!YWAVNU
> From: Wayne <YWAVNU@ijehs.ac.th>
> Message-ID: <4FD1F099.2544C6BA@BASIC.BS.WEBUSENET.COM>
> Newsgroups: comp.lang.prograph,comp.lang.prolog,comp.lang.python
> Subject: Free P.0.R.N.0! Link Inside! Fresh post!        5vIiiTQU RtP5cQY0
> Lines: 96
> X-Complaints-To: abuse@usenetserver.com
> X-Abuse-Info: Please be sure to forward a copy of ALL headers
> X-Abuse-Info: Otherwise we will be unable to process your complaint
> properly.
> NNTP-Posting-Date: Mon, 05 Aug 2002 23:34:02 EDT
> Organization: WEBUSENET.com
> Date: Mon, 6 Aug 2002 03:30:28 GMT
> Xref: news.baymountain.net comp.lang.prograph:12849 comp.lang.prolog:26979
>         comp.lang.python:176063
> To: python-list@python.org
> Sender: python-list-admin@python.org
> Errors-To: python-list-admin@python.org
> X-BeenThere: python-list@python.org
> X-Mailman-Version: 2.0.13 (101270)
> Precedence: bulk
> List-Help: <mailto:python-list-request@python.org?subject=help>
> List-Post: <mailto:python-list@python.org>
> List-Subscribe: <http://mail.python.org/mailman/listinfo/python-list>,
>         <mailto:python-list-request@python.org?subject=subscribe>
> List-Id: General discussion list for the Python programming language
>         <python-list.python.org>
> List-Unsubscribe: <http://mail.python.org/mailman/listinfo/python-list>,
>         <mailto:python-list-request@python.org?subject=unsubscribe>
> List-Archive: <http://mail.python.org/pipermail/python-list/>
>
> http://www.virtualwebmedia.com
>
> F-R-E-E p o r  n! NO S+P+A+M!
>
> obusj
>
> D2sMJ0EOdQMT
>
> Try arriving the shore's upper coffee and Eve will explain you!
>
> yzehpu@virtualwebmedia.com
>
> nomjyd@virtualwebmedia.com
>
> ogeddusu@virtualwebmedia.com
>
> uN9xyVlj
>
> Are you shallow, I mean, pouring in front of humble bushs?
>
> Sometimes, tags expect near upper houses, unless they're blank.
>
> Hardly any empty ointments look Fred, and they weekly talk Brahimi too.
> Hey, it solves a disk too shallow beside her wet office.  I am
> surprisingly strong, so I learn you.  It will grasp locally if
> Jimmy's potter isn't polite.  Otherwise the kettle in Cristof's
> card might recollect some old desks.  Will you scold among the
> lane, if Julieta furiously laughs the pen?
>
> He'll be liking within urban Norm until his onion climbs lazily.  As
> globally as Ramzi attacks, you can hate the puddle much more
> gently.  Daoud!  You'll pour stickers.  Sometimes, I'll nibble the
> yogi.  I was dining to sow you some of my pathetic films.
>
> Until Ayub cleans the doses stupidly, Jimmy won't dye any proud
> stations.  If the strange pickles can wander undoubtably, the
> worthwhile poultice may promise more showers.  Some weak smart
> dusts tamely fear as the filthy sauces pull.  They are irrigating
> within the spring now, won't seek powders later.  Try not to
> excuse a envelope!  Many angry tapes inside the dull monolith were
> behaving between the younger bedroom.
>
> Hardly any rural bizarre jars will totally attempt the bowls.
> When will we jump after Susan joins the fresh shore's spoon?  She'd rather
> taste finitely than receive with Ziad's new lemon.
>
> Other abysmal sticky clouds will tease quickly with elbows.  Better
> measure porters now or George will deeply love them on you.
>
> We cook them, then we dully waste Saad and Charlene's lazy book.
>
> For Usha the cat's sad, alongside me it's open, whereas in back
> of you it's
> departing short.  The coffees, buckets, and twigs are all raw and
> sharp.  Do not improve nearly while you're recommending before a
> cold teacher.  Try covering the navel's tired draper and Oscar will
> move you!  Saeed, have a dry shoe.  You won't open it.
>
> Lately Beryl will comb the lentil, and if Jbilou incredibly burns it too,
> the
> hat will converse behind the dirty night.  If you will live Rachel's
> structure within figs, it will biweekly call the floor.  Gawd,
> Wally never judges until Zakariya creeps the think ache usably.
> What Penny's kind can walks, Harvey moulds beside poor, pretty
> halls.
>
> Are you dark, I mean, kicking beside weird hens?  I was dreaming
> pumpkins to deep Amber, who's playing through the dryer's stable.
> Mahammed explains, then Hamza loudly fills a ugly cap near Karim's
> light.  You won't answer me helping before your wide river.  Let's
> lift outside the handsome rivers, but don't reject the cosmetic
> pitchers.  Lots of unique painters are hot and other brave jugs are
> lean, but will Yosri care that?  Her dog was humble, blunt, and
> changes under the cellar.  We smell once, shout grudgingly, then
> irritate inside the gardner behind the desert.  She will annually
> order outer and arrives our durable, cheap bushs with a ceiling.
>
> A lot of frogs rigidly believe the difficult doorway.  She wants to
> mould thin smogs above Shah's fog.  Who talks weakly, when Gregory
> joins the full pin without the kiosk?  Every rich farmer or plain, and
> she'll
> inadvertently smell everybody.  It's very clean today, I'll waste
> unbelievably or Allahdad will attack the units.  Where did Agha
> laugh within all the barbers?  We can't fear buttons unless Timothy will
> virtually expect afterwards.  No sauces will be active sweet
> eggs.  David, on games young and healthy, teases within it, behaving
> smartly.  When did Ali kill the printer around the clever weaver?
> I like quiet plates, do you learn them?  It might neatly dream
> against Stephanie when the heavy boats open to the lost camp.
> He will kick stale wrinkles beside the good glad ladder, whilst
> Guido stupidly creeps them too.  Just receiving without a goldsmith
> in back of the highway is too inner for Elisabeth to converse it.  The
> distant walnut rarely grasps Byron, it sows Lisette instead.
> She may excuse the easy cup and reject it within its square.  While
> oranges familiarly judge codes, the cases often burn at the lower
> tailors.
> *************************************************************************
>
> The next false positive was
>
>     Subject: Viewing JPG's
>
> and certainly was not a spam.  Ditto
>
>     Subject: If that interests you...
>
> following, and then
>
>     Subject: Oil on troubled waters ...
>
> and then
>
>     Subject: Find where you left off...
>
> The first false negatives were
>
>     Subject: howdy ukqip
>     Subject: 阿姆瑞特(亚洲)网络有限公司
>     Subject: Re: hi!
>     Subject: a link with minimal effort.
>         [must be about Solaris shared libraries <wink>]
>     Subject: Re: info
>     Subject: Interview Charles Payne
>     Subject: re: order
>     Subject: Possible Cause of Brain Cancer
>     Subject: re
>     Subject: cool
>     Subject: It is beneficial to your library & its patrons to have
>              the book (Please suggest)
>     Subject: Re: ...something to think about?
>     Subject: Python Video Promotion!
>         [LOL!]
>     Subject: RE: hey!
>     Subject: you too?
>     Subject: http://untroubled.org/relay-ctrl/
>     Subject: Surprise ?
>         [variations of "surprise" have high spamprob, but this was
>          saved by prob('subject: ?') = 0.0291312
>          for whatever reason, newbies on c.l.py often leave a space
>          in front of their question marks, and the tokenize-runs-
>          of-punctuation gimmick found that]
>     Subject: Help Needed
>     Subject: Waste Review
>     Subject: Re: Email Confirmation
>     Subject: Re: about this email...
>     Subject: Re: new information
>     Subject: Networking Position
>     Subject: RE:
>     Subject: What do you get when you cross a sorority girl with an ape?
>     Subject: re: entry conf.
>
> Enough already -- it did have one triumph, and plenty of those false
> negatives would have been false negatives for me "by hand" too.  Overall:
>
> -> <stat> tested 2000 hams & 1400 spams against 18000 hams & 12600 spams
>    [ditto 19 times]
>
> false positive percentages
>     0.000  0.300  lost  +(was 0)
>     0.000  0.200  lost  +(was 0)
>     0.000  0.350  lost  +(was 0)
>     0.000  0.400  lost  +(was 0)
>     0.050  0.500  lost  +900.00%
>     0.000  0.400  lost  +(was 0)
>     0.000  0.550  lost  +(was 0)
>     0.000  0.400  lost  +(was 0)
>     0.000  0.300  lost  +(was 0)
>     0.050  0.450  lost  +800.00%
>
> won   0 times
> tied  0 times
> lost 10 times
>
> total unique fp went from 2 to 77 lost  +3750.00%
> mean fp % went from 0.01 to 0.385 lost  +3750.00%
>
> false negative percentages
>     0.071  4.000  lost  +5533.80%
>     0.071  2.429  lost  +3321.13%
>     0.000  2.643  lost  +(was 0)
>     0.071  3.571  lost  +4929.58%
>     0.143  2.786  lost  +1848.25%
>     0.214  3.357  lost  +1468.69%
>     0.143  3.071  lost  +2047.55%
>     0.143  3.071  lost  +2047.55%
>     0.214  2.929  lost  +1268.69%
>     0.000  2.429  lost  +(was 0)
>
> won   0 times
> tied  0 times
> lost 10 times
>
> total unique fn went from 15 to 424 lost  +2726.67%
> mean fn % went from 0.107142857143 to 3.02857142857 lost  +2726.67%
>
> ham mean                     ham sdev
>   28.00   14.11  -49.61%        5.80    9.86  +70.00%
>   27.93   13.79  -50.63%        5.62    9.60  +70.82%
>   27.91   14.02  -49.77%        5.76   10.08  +75.00%
>   28.02   13.79  -50.79%        5.67   10.28  +81.31%
>   27.82   14.02  -49.60%        5.85   10.38  +77.44%
>   27.88   13.84  -50.36%        5.53   10.68  +93.13%
>   28.05   14.16  -49.52%        5.69   10.35  +81.90%
>   28.00   14.07  -49.75%        5.54   10.33  +86.46%
>   28.14   14.27  -49.29%        5.61   10.17  +81.28%
>   28.16   13.95  -50.46%        5.93   10.61  +78.92%
>
> ham mean and sdev for all runs
>   27.99   14.00  -49.98%        5.70   10.24  +79.65%
>
> spam mean                    spam sdev
>   85.00   84.96   -0.05%        6.92   12.31  +77.89%
>   84.80   85.33   +0.63%        6.66   11.52  +72.97%
>   84.48   85.83   +1.60%        6.57   11.41  +73.67%
>   85.01   85.25   +0.28%        6.65   12.08  +81.65%
>   85.01   85.83   +0.96%        6.49   11.06  +70.42%
>   84.89   85.60   +0.84%        6.82   11.71  +71.70%
>   84.61   85.21   +0.71%        6.68   11.83  +77.10%
>   85.00   85.85   +1.00%        6.52   11.46  +75.77%
>   85.02   85.61   +0.69%        6.78   11.09  +63.57%
>   84.96   85.80   +0.99%        6.47   11.78  +82.07%
>
> spam mean and sdev for all runs
>   84.88   85.53   +0.77%        6.66   11.64  +74.77%
>
> ham/spam mean difference: 56.89 71.53 +14.64
>
> Histogram analysis suggested:
>
> -> best cutoff for all runs: 0.575
> ->     with weighted total 10*65 fp + 486 fn = 1136
> ->     fp rate 0.325%  fn rate 3.47%
>
> If someone wants to run with this, the list of best discriminators may be
> helpful (although under Gary's scheme this is more like a list of
> most-frequent discriminators):
>
>     best discriminators:
>         'subject:are' 493 0.846732
>         'subject:mortgage' 500 0.999499
>         'subject:can' 507 0.606534
>         'subject:now' 508 0.927579
>         'subject:PEP' 512 0.000488652
>         'subject:(was' 515 0.000488652
>         'subject:newbie' 522 0.000480307
>         'subject: $' 523 0.992175
>         'subject:no' 528 0.751766
>         'subject:You' 532 0.981916
>         'subject:] ' 535 0.196305
>         'subject:,' 541 0.894722
>         'subject:pep' 546 0.000457829
>         'subject:help' 556 0.288172
>         'subject:problem' 560 0.0836141
>         'subject: & ' 570 0.75976
>         'subject:! ' 597 0.947754
>         'subject:FREE' 598 0.999584
>         'subject:this' 606 0.695102
>         'subject:&' 637 0.731907
>         'subject:How' 671 0.252991
>         'subject:get' 684 0.876916
>         'subject:new' 770 0.672979
>         'subject:question' 810 0.00423811
>         'subject:Your' 883 0.991676
>         'subject:from' 899 0.399686
>         'subject:how' 962 0.236597
>         'subject:was' 969 0.0148958
>         'subject:.' 1036 0.40169
>         'subject: - ' 1126 0.615247
>         'subject:/' 1180 0.357549
>         'subject:free' 1204 0.982183
>         'subject:is' 1216 0.349481
>         'subject:RE:' 1326 0.133728
>         'subject:)' 1364 0.258211
>         'subject: (' 1480 0.139366
>         'subject:with' 1496 0.331923
>         'subject:you' 1737 0.941033
>         'subject:, ' 1955 0.640872
>         'subject:the' 1984 0.635883
>         'subject:!' 2072 0.87807
>         'subject:your' 2260 0.984736
>         'subject:in' 2362 0.310965
>         'subject:and' 2414 0.390133
>         'subject:Python' 4321 0.00115551
>         'subject:?' 4796 0.19175
>         'subject:python' 5482 0.000912976
>         'subject:Re:' 15254 0.0194725
>         'subject:re:' 16662 0.0399341
>         'subject:: ' 17253 0.0965151
>
> Note that variations of "Re:" really gave some spam a boost.
>
>
>
> _______________________________________________
> Spambayes mailing list
> Spambayes@python.org
> http://mail.python.org/mailman-21/listinfo/spambayes