[Spambayes] dumb question - why not simply the subject?
David LeBlanc
whisper@oz.net
Sat, 28 Sep 2002 20:22:42 -0700
I've gotten spams with subjects like "About your recent order" - not easy to
discern that's spam.
Also, sender's email address isn't reliable either. Those people change
their fake sender's address faster then a politician during election season
changes his campaign platform ;)
David LeBlanc
Seattle, WA USA
> -----Original Message-----
> From: spambayes-bounces+whisper=oz.net@python.org
> [mailto:spambayes-bounces+whisper=oz.net@python.org]On Behalf Of Tim
> Peters
> Sent: Saturday, September 28, 2002 20:13
> To: skip@pobox.com
> Cc: spambayes@python.org
> Subject: RE: [Spambayes] dumb question - why not simply the subject?
>
>
> [Skip Montanaro]
> > It was (I thought obviously) more of a rhetorical question than
> > anything. I have no problem identifying spam vs non-spam in my own
> > corpus.
>
> I believe you overestimate your own ability to do this based on
> subject line
> alone, due to selective memory of outrageously spammish subject lines.
> That's not a jab at you, it's how humans work. For example,
> *just* grep for
> the Subject lines in the compilation of false negatives you put
> on the web.
> If you can nail 100% of those not knowing that you already believe they're
> spam, I won't believe you <wink>. What are you going to do with a subject
> line that says just "HELP", or "hi"? I get plenty of those in both my
> real-life ham and spam.
>
> You'll also find that you left at least one
>
> Subject: New subscription request to list CEDU from edshipp@ameritech.net
>
> message in your false negatives, which pretty much kills the notion that
> you're infallible <wink>.
>
> > I don't doubt that the current timcv would do better than
> > your average human, especially if you require response times in the
> > millisecond range,
>
> I was kidding about that part. I'm certain that the error rates
> it displays
> on my corpus are better than I could do by hand, though.
>
> > but I suspect with a human and timcv restricted
> > to examining just the subject the human would win.
>
> That's a different scenario entirely. To mine subject lines, and only
> subject lines, likely requires vast semantic knowledge to do really well.
> Then you're in the realm of AI, and that's got a track record that speaks
> for itself <wink>.
>
> As an experiment, I changed tokenize() to ignore the body completely, and
> changed tokenize_headers() to do only this:
>
> x = msg.get('subject', '')
> if x == '':
> yield 'No subject'
> return
> lastt = ''
> for w in subject_word_re.findall(x):
> for t in tokenize_word(w):
> yield 'subject:' + t
> yield 'subject:' + t.lower()
> if lastt:
> yield 'subject:' + lastt + '<->' + t
> lastt = t
> for w in x.split():
> yield 'subject:' + w
> yield 'subject:' + w.lower()
> for w in junk_re.findall(x):
> yield 'subject:' + w
> return
>
> where
>
> junk_re = re.compile(r"\W+")
>
> IOW, it goes hog wild on the Subject line alone, doing unigrams
> and bigrams,
> case-folding and not case-folding, splitting on whitespace and
> searching for
> alphanumeric runs, and even tokenizing runs of pure punctuation.
>
> Of course this doesn't do as well as our regular tokenization, but it does
> suggest there's a world of clues in subject lines that we could be getting
> more value from. Cool: the first false positive it turned up is actually
> spam that nothing else has caught!
>
> *************************************************************************
> Data/Ham/Set1/156846.txt
> prob = 0.637311663424
> prob('subject:post') = 0.061641
> prob('subject:Link') = 0.0652174
> prob('subject:.') = 0.398331
> prob('subject:inside!') = 0.844828
> prob('subject:fresh') = 0.909196
> prob('subject:Fresh') = 0.909218
> prob('subject:Inside') = 0.929104
> prob('subject:! ') = 0.950089
> prob('subject:free') = 0.983887
> prob('subject:Free') = 0.984495
>
> Path:
> news.baymountain.net!uunet!ash.uu.net!priapus.visi.com!news-out.vi
> si.com!her
> mes.visi.com!newsfeeds-atl2!news.webusenet.com!pc01.webusenet.com!
> e3500-atl1
> .usenetserver.com.POSTED!ijehs.ac.th!YWAVNU
> From: Wayne <YWAVNU@ijehs.ac.th>
> Message-ID: <4FD1F099.2544C6BA@BASIC.BS.WEBUSENET.COM>
> Newsgroups: comp.lang.prograph,comp.lang.prolog,comp.lang.python
> Subject: Free P.0.R.N.0! Link Inside! Fresh post! 5vIiiTQU RtP5cQY0
> Lines: 96
> X-Complaints-To: abuse@usenetserver.com
> X-Abuse-Info: Please be sure to forward a copy of ALL headers
> X-Abuse-Info: Otherwise we will be unable to process your complaint
> properly.
> NNTP-Posting-Date: Mon, 05 Aug 2002 23:34:02 EDT
> Organization: WEBUSENET.com
> Date: Mon, 6 Aug 2002 03:30:28 GMT
> Xref: news.baymountain.net comp.lang.prograph:12849 comp.lang.prolog:26979
> comp.lang.python:176063
> To: python-list@python.org
> Sender: python-list-admin@python.org
> Errors-To: python-list-admin@python.org
> X-BeenThere: python-list@python.org
> X-Mailman-Version: 2.0.13 (101270)
> Precedence: bulk
> List-Help: <mailto:python-list-request@python.org?subject=help>
> List-Post: <mailto:python-list@python.org>
> List-Subscribe: <http://mail.python.org/mailman/listinfo/python-list>,
> <mailto:python-list-request@python.org?subject=subscribe>
> List-Id: General discussion list for the Python programming language
> <python-list.python.org>
> List-Unsubscribe: <http://mail.python.org/mailman/listinfo/python-list>,
> <mailto:python-list-request@python.org?subject=unsubscribe>
> List-Archive: <http://mail.python.org/pipermail/python-list/>
>
> http://www.virtualwebmedia.com
>
> F-R-E-E p o r n! NO S+P+A+M!
>
> obusj
>
> D2sMJ0EOdQMT
>
> Try arriving the shore's upper coffee and Eve will explain you!
>
> yzehpu@virtualwebmedia.com
>
> nomjyd@virtualwebmedia.com
>
> ogeddusu@virtualwebmedia.com
>
> uN9xyVlj
>
> Are you shallow, I mean, pouring in front of humble bushs?
>
> Sometimes, tags expect near upper houses, unless they're blank.
>
> Hardly any empty ointments look Fred, and they weekly talk Brahimi too.
> Hey, it solves a disk too shallow beside her wet office. I am
> surprisingly strong, so I learn you. It will grasp locally if
> Jimmy's potter isn't polite. Otherwise the kettle in Cristof's
> card might recollect some old desks. Will you scold among the
> lane, if Julieta furiously laughs the pen?
>
> He'll be liking within urban Norm until his onion climbs lazily. As
> globally as Ramzi attacks, you can hate the puddle much more
> gently. Daoud! You'll pour stickers. Sometimes, I'll nibble the
> yogi. I was dining to sow you some of my pathetic films.
>
> Until Ayub cleans the doses stupidly, Jimmy won't dye any proud
> stations. If the strange pickles can wander undoubtably, the
> worthwhile poultice may promise more showers. Some weak smart
> dusts tamely fear as the filthy sauces pull. They are irrigating
> within the spring now, won't seek powders later. Try not to
> excuse a envelope! Many angry tapes inside the dull monolith were
> behaving between the younger bedroom.
>
> Hardly any rural bizarre jars will totally attempt the bowls.
> When will we jump after Susan joins the fresh shore's spoon? She'd rather
> taste finitely than receive with Ziad's new lemon.
>
> Other abysmal sticky clouds will tease quickly with elbows. Better
> measure porters now or George will deeply love them on you.
>
> We cook them, then we dully waste Saad and Charlene's lazy book.
>
> For Usha the cat's sad, alongside me it's open, whereas in back
> of you it's
> departing short. The coffees, buckets, and twigs are all raw and
> sharp. Do not improve nearly while you're recommending before a
> cold teacher. Try covering the navel's tired draper and Oscar will
> move you! Saeed, have a dry shoe. You won't open it.
>
> Lately Beryl will comb the lentil, and if Jbilou incredibly burns it too,
> the
> hat will converse behind the dirty night. If you will live Rachel's
> structure within figs, it will biweekly call the floor. Gawd,
> Wally never judges until Zakariya creeps the think ache usably.
> What Penny's kind can walks, Harvey moulds beside poor, pretty
> halls.
>
> Are you dark, I mean, kicking beside weird hens? I was dreaming
> pumpkins to deep Amber, who's playing through the dryer's stable.
> Mahammed explains, then Hamza loudly fills a ugly cap near Karim's
> light. You won't answer me helping before your wide river. Let's
> lift outside the handsome rivers, but don't reject the cosmetic
> pitchers. Lots of unique painters are hot and other brave jugs are
> lean, but will Yosri care that? Her dog was humble, blunt, and
> changes under the cellar. We smell once, shout grudgingly, then
> irritate inside the gardner behind the desert. She will annually
> order outer and arrives our durable, cheap bushs with a ceiling.
>
> A lot of frogs rigidly believe the difficult doorway. She wants to
> mould thin smogs above Shah's fog. Who talks weakly, when Gregory
> joins the full pin without the kiosk? Every rich farmer or plain, and
> she'll
> inadvertently smell everybody. It's very clean today, I'll waste
> unbelievably or Allahdad will attack the units. Where did Agha
> laugh within all the barbers? We can't fear buttons unless Timothy will
> virtually expect afterwards. No sauces will be active sweet
> eggs. David, on games young and healthy, teases within it, behaving
> smartly. When did Ali kill the printer around the clever weaver?
> I like quiet plates, do you learn them? It might neatly dream
> against Stephanie when the heavy boats open to the lost camp.
> He will kick stale wrinkles beside the good glad ladder, whilst
> Guido stupidly creeps them too. Just receiving without a goldsmith
> in back of the highway is too inner for Elisabeth to converse it. The
> distant walnut rarely grasps Byron, it sows Lisette instead.
> She may excuse the easy cup and reject it within its square. While
> oranges familiarly judge codes, the cases often burn at the lower
> tailors.
> *************************************************************************
>
> The next false positive was
>
> Subject: Viewing JPG's
>
> and certainly was not a spam. Ditto
>
> Subject: If that interests you...
>
> following, and then
>
> Subject: Oil on troubled waters ...
>
> and then
>
> Subject: Find where you left off...
>
> The first false negatives were
>
> Subject: howdy ukqip
> Subject: 阿姆瑞特(亚洲)网络有限公司
> Subject: Re: hi!
> Subject: a link with minimal effort.
> [must be about Solaris shared libraries <wink>]
> Subject: Re: info
> Subject: Interview Charles Payne
> Subject: re: order
> Subject: Possible Cause of Brain Cancer
> Subject: re
> Subject: cool
> Subject: It is beneficial to your library & its patrons to have
> the book (Please suggest)
> Subject: Re: ...something to think about?
> Subject: Python Video Promotion!
> [LOL!]
> Subject: RE: hey!
> Subject: you too?
> Subject: http://untroubled.org/relay-ctrl/
> Subject: Surprise ?
> [variations of "surprise" have high spamprob, but this was
> saved by prob('subject: ?') = 0.0291312
> for whatever reason, newbies on c.l.py often leave a space
> in front of their question marks, and the tokenize-runs-
> of-punctuation gimmick found that]
> Subject: Help Needed
> Subject: Waste Review
> Subject: Re: Email Confirmation
> Subject: Re: about this email...
> Subject: Re: new information
> Subject: Networking Position
> Subject: RE:
> Subject: What do you get when you cross a sorority girl with an ape?
> Subject: re: entry conf.
>
> Enough already -- it did have one triumph, and plenty of those false
> negatives would have been false negatives for me "by hand" too. Overall:
>
> -> <stat> tested 2000 hams & 1400 spams against 18000 hams & 12600 spams
> [ditto 19 times]
>
> false positive percentages
> 0.000 0.300 lost +(was 0)
> 0.000 0.200 lost +(was 0)
> 0.000 0.350 lost +(was 0)
> 0.000 0.400 lost +(was 0)
> 0.050 0.500 lost +900.00%
> 0.000 0.400 lost +(was 0)
> 0.000 0.550 lost +(was 0)
> 0.000 0.400 lost +(was 0)
> 0.000 0.300 lost +(was 0)
> 0.050 0.450 lost +800.00%
>
> won 0 times
> tied 0 times
> lost 10 times
>
> total unique fp went from 2 to 77 lost +3750.00%
> mean fp % went from 0.01 to 0.385 lost +3750.00%
>
> false negative percentages
> 0.071 4.000 lost +5533.80%
> 0.071 2.429 lost +3321.13%
> 0.000 2.643 lost +(was 0)
> 0.071 3.571 lost +4929.58%
> 0.143 2.786 lost +1848.25%
> 0.214 3.357 lost +1468.69%
> 0.143 3.071 lost +2047.55%
> 0.143 3.071 lost +2047.55%
> 0.214 2.929 lost +1268.69%
> 0.000 2.429 lost +(was 0)
>
> won 0 times
> tied 0 times
> lost 10 times
>
> total unique fn went from 15 to 424 lost +2726.67%
> mean fn % went from 0.107142857143 to 3.02857142857 lost +2726.67%
>
> ham mean ham sdev
> 28.00 14.11 -49.61% 5.80 9.86 +70.00%
> 27.93 13.79 -50.63% 5.62 9.60 +70.82%
> 27.91 14.02 -49.77% 5.76 10.08 +75.00%
> 28.02 13.79 -50.79% 5.67 10.28 +81.31%
> 27.82 14.02 -49.60% 5.85 10.38 +77.44%
> 27.88 13.84 -50.36% 5.53 10.68 +93.13%
> 28.05 14.16 -49.52% 5.69 10.35 +81.90%
> 28.00 14.07 -49.75% 5.54 10.33 +86.46%
> 28.14 14.27 -49.29% 5.61 10.17 +81.28%
> 28.16 13.95 -50.46% 5.93 10.61 +78.92%
>
> ham mean and sdev for all runs
> 27.99 14.00 -49.98% 5.70 10.24 +79.65%
>
> spam mean spam sdev
> 85.00 84.96 -0.05% 6.92 12.31 +77.89%
> 84.80 85.33 +0.63% 6.66 11.52 +72.97%
> 84.48 85.83 +1.60% 6.57 11.41 +73.67%
> 85.01 85.25 +0.28% 6.65 12.08 +81.65%
> 85.01 85.83 +0.96% 6.49 11.06 +70.42%
> 84.89 85.60 +0.84% 6.82 11.71 +71.70%
> 84.61 85.21 +0.71% 6.68 11.83 +77.10%
> 85.00 85.85 +1.00% 6.52 11.46 +75.77%
> 85.02 85.61 +0.69% 6.78 11.09 +63.57%
> 84.96 85.80 +0.99% 6.47 11.78 +82.07%
>
> spam mean and sdev for all runs
> 84.88 85.53 +0.77% 6.66 11.64 +74.77%
>
> ham/spam mean difference: 56.89 71.53 +14.64
>
> Histogram analysis suggested:
>
> -> best cutoff for all runs: 0.575
> -> with weighted total 10*65 fp + 486 fn = 1136
> -> fp rate 0.325% fn rate 3.47%
>
> If someone wants to run with this, the list of best discriminators may be
> helpful (although under Gary's scheme this is more like a list of
> most-frequent discriminators):
>
> best discriminators:
> 'subject:are' 493 0.846732
> 'subject:mortgage' 500 0.999499
> 'subject:can' 507 0.606534
> 'subject:now' 508 0.927579
> 'subject:PEP' 512 0.000488652
> 'subject:(was' 515 0.000488652
> 'subject:newbie' 522 0.000480307
> 'subject: $' 523 0.992175
> 'subject:no' 528 0.751766
> 'subject:You' 532 0.981916
> 'subject:] ' 535 0.196305
> 'subject:,' 541 0.894722
> 'subject:pep' 546 0.000457829
> 'subject:help' 556 0.288172
> 'subject:problem' 560 0.0836141
> 'subject: & ' 570 0.75976
> 'subject:! ' 597 0.947754
> 'subject:FREE' 598 0.999584
> 'subject:this' 606 0.695102
> 'subject:&' 637 0.731907
> 'subject:How' 671 0.252991
> 'subject:get' 684 0.876916
> 'subject:new' 770 0.672979
> 'subject:question' 810 0.00423811
> 'subject:Your' 883 0.991676
> 'subject:from' 899 0.399686
> 'subject:how' 962 0.236597
> 'subject:was' 969 0.0148958
> 'subject:.' 1036 0.40169
> 'subject: - ' 1126 0.615247
> 'subject:/' 1180 0.357549
> 'subject:free' 1204 0.982183
> 'subject:is' 1216 0.349481
> 'subject:RE:' 1326 0.133728
> 'subject:)' 1364 0.258211
> 'subject: (' 1480 0.139366
> 'subject:with' 1496 0.331923
> 'subject:you' 1737 0.941033
> 'subject:, ' 1955 0.640872
> 'subject:the' 1984 0.635883
> 'subject:!' 2072 0.87807
> 'subject:your' 2260 0.984736
> 'subject:in' 2362 0.310965
> 'subject:and' 2414 0.390133
> 'subject:Python' 4321 0.00115551
> 'subject:?' 4796 0.19175
> 'subject:python' 5482 0.000912976
> 'subject:Re:' 15254 0.0194725
> 'subject:re:' 16662 0.0399341
> 'subject:: ' 17253 0.0965151
>
> Note that variations of "Re:" really gave some spam a boost.
>
>
>
> _______________________________________________
> Spambayes mailing list
> Spambayes@python.org
> http://mail.python.org/mailman-21/listinfo/spambayes