Ham or Spam? (was RE: [Spambayes] RE: Central Limit Theorem??!! :))

Sat, 28 Sep 2002 02:27:33 -0400

[Tim]
> FYI, another preliminary observation is that the logarithmic
> central-limit scheme seems (in my data) to be very unsure (high
> sdevs away from both means) about:
>
> + Msgs in German.
> + Msgs in French.
> + Msgs in Spanish.

On second look, this *may* be limited to short msgs in "funny languages".

[Greg Ward]
> Good, I think!  Two of my persistent FPs are:
>
>   * a message from CIRA (the Canadian Internet Registration Authority),
>     which is half in French
>   * a one-line "subscribe" sent to <mumble>-request@python.org,
>     with a loooong disclaimer in Spanish
>
> Both of these are killed by perfectly ordinary, vanilla French and
> Spanish words, because I get very little real email in French and none
> in Spanish.  (Oddly enough, I get very little spam in French too.  In
> fact, most of the spam that gets past SpamAssassin to me is in German,
> which highlights another of SA's weaknesses and a *potential* strength
> of the statistical approach.  I used to get a lot of spam in Portuguese,
> but I added some SA tests that score messages from Brazil and Portugal a
> bit higher.  Seems to have put a stop to that.)

Virtually all words from inferior languages <wink> have high spamprob in my
corpus too.  For a long time, an announcement of a new Spanish Python group
was an f-p for me; I'm not sure what eventually redeemed it.

> I bet Italian would suffer the same fate except that I have a week's
> worth of traffic from zope-it@zope.org in my ham collection.
>
> Incidentally, here are the probs for the CIRA message -- interestingly,
> this is one of the few messages in the gward-ham corpus that was *not*
> sent to gward@python.net, so it still has many traces of my ISP in the
> Received headers.  None of my spam went through my ISP, so all those
> clues are 0.01 clues -- but they still didn't help.

You have to stop using the Graham scheme now (because the code no longer
exists <wink>).  Random cancellation of many 0.01 and 0.99 clues was unique
to it, and was an artifact of its arbitrary limitations.  This kind of thing
doesn't happen under the new scheme.  That doesn't mean it will stop being
an f-p, but it *will* stop getting a score of 1.0.  Let me know how it does
now.

> Data/Ham/Set9/1014826921.10087.cthulhu.gerg.ca:2,S
> prob = 1.0
> prob('membres') = 0.01
> prob('to:videotron.ca') = 0.01
> prob('received:greg') = 0.01
> prob('cher') = 0.01
> prob('positions.') = 0.01
> prob('received:videotron.ca') = 0.01
> prob('header:Return-path:1') = 0.01
> prob('received:bellnexxia.net') = 0.01
> prob('nominations') = 0.01
> prob('re:') = 0.01
> prob('received:24.201') = 0.01
> prob('received:pop.videotron.ca') = 0.01
> prob('received:24.201.245') = 0.01
> prob('avis') = 0.01
> prob('received:24.201.245.36') = 0.01
> prob('received:sc1.videotron.ca') = 0.01
> prob('return-path:members') = 0.01
> prob('received:vl-ms-mr002.sc1.videotron.ca') = 0.01
> prob('des') = 0.9875
> prob('sera') = 0.99
> prob('comit\xe9') = 0.99
> prob('qui') = 0.99
> prob('enabling') = 0.99
> prob('voter') = 0.99
> prob('mai') = 0.99
> prob('besoin') = 0.99
> prob('mises') = 0.99
> prob('apr\xe8s') = 0.99
> prob('ligne') = 0.99
> prob('mars') = 0.99
> prob('candidat') = 0.99
> prob('election') = 0.99
> prob('vous') = 0.99
> prob('aussi') = 0.99
> prob('une') = 0.99
> prob('trois') = 0.99
> prob('entre') = 0.99
> prob('ces') = 0.99
> prob('avril') = 0.99
> prob('leur') = 0.99
> prob('directors.') = 0.99
> prob('objet') = 0.99
> prob('faire') = 0.99
> prob('nominated') = 0.99
> prob('personnes') = 0.99
> prob('veuillez') = 0.99
> prob('elected') = 0.99
> prob('toute') = 0.99
> prob('received:sympatico.ca') = 0.99
> prob('possession,') = 0.99
> prob('poser') = 0.99
> prob('peut') = 0.99
>
> Would be interesting to score just the English half of that message.

Expensive, too <0.5 wink>.

> Oh, here are the probs for the "subscribe" request with Spanish
> disclaimer:

This is more Grahamian "cancellation disease".

> Data/Ham/Set10/17rK7U-0007e7-00:2,S
> prob = 1.0
> prob('subject:subscribe') = 0.01
> prob('confidencial') = 0.01
> prob('autorizado') = 0.01
> prob('contenida') = 0.01
> prob('fon') = 0.01
> prob('mensaje.') = 0.01
> prob('message-id:skip:A 40') = 0.01
> prob('est\xe1') = 0.99
> prob('cual') = 0.99
> prob('mensajes') = 0.99
> prob('pueda') = 0.99
> prob('hacer') = 0.99
> prob('diego') = 0.99
> prob('llegar') = 0.99
> prob('informaci\xf3n') = 0.99
> prob('alguna') = 0.99
> prob('exclusivo') = 0.99
> prob('empresa') = 0.99
> prob('motivo') = 0.99
> prob('telef\xf3nica') = 0.99
> prob('gesti\xf3n') = 0.99
> prob('haber') = 0.99
> prob('received:167]') = 0.99
> prob('propias') = 0.99
> prob('ud.') = 0.99
> prob('todas') = 0.99
> prob('mismo,') = 0.99
> prob('son') = 0.99
> prob('personales') = 0.99
> prob('pueden') = 0.99
> prob('responsable') = 0.99
>
> This message has got to win some sort of prize for the wheat-to-chaff
> ratio: it requires 4.6 kB of signature, MIME junk, and disclaimer to say
> "subscribe".  Here are the headers:
>
> """
> Return-Path: <DQuevedo@uniFON.com.ar>
> Envelope-To: python-list-request@python.org
> Received: from [200.16.211.167] (helo=estcp.tcp.com.ar)
>         by mail.python.org with esmtp (Exim 4.05)
>         id 17rK7U-0007e7-00
>         for python-list-request@python.org; Tue, 17 Sep 2002
> 11:18:32 -0400
> Received: by noticias.unifon.com.ar with Internet Mail Service
> (5.5.2653.19)
>         id <S6X3XHKB>; Tue, 17 Sep 2002 12:15:11 -0300
> Message-ID: <A128D751272CD411BC9200508BC2194D019B6215@escpl.tcp.com.ar>
> From: "Quevedo, Diego" <DQuevedo@uniFON.com.ar>
> To: "'python-list-request@python.org'" <python-list-request@python.org>
> Subject: subscribe
> Date: Tue, 17 Sep 2002 12:15:33 -0300
> Return-Receipt-To: "Quevedo, Diego" <DQuevedo@uniFON.com.ar>
> MIME-Version: 1.0
> X-Mailer: Internet Mail Service (5.5.2653.19)
> Content-Type: multipart/alternative;
>         boundary="----_=_NextPart_001_01C25E5D.1189FAE0"
> """
>
> and here is the text/plain part:
>
> """
> subscribe
>
>
> Diego Quevedo
> Gesti?n de Red
> Tel: (54-11) 4324-9103
> Cel: 15-5132-0135
> dquevedo@unifon.com.ar
> unifon

I'm going to spare everyone a repetition of the rest <wink>.  "subscribe" /
"unsubscribe" msgs are a real bitch, and I expect that under the current
scheme you'll find that both words have high spamprob (well, they did under
the Graham scheme too, but buried under so many bogus probabilities of 0.01
and 0.99 that sometimes you never got to looking at them).  BTW, note that
"subscribe" in the Subject line had a spamprob of 0.01 there:  context
tagging is an extension of the Graham scheme, and was a big win for us.