[spambayes-dev] empty urls in bigram?

Skip Montanaro skip at pobox.com
Wed Dec 17 18:07:21 EST 2003


   I just noticed this bigram in my clues: 'bi:url: url:'.  If 'url:' would
   only be presented once as a clue, does it make sense to form a bigram
   with two instances of it?

More examples:

    >>> [k for k in db if re.match(r"bi:([^ ]+) \1$", k) is not None]
    ['bi:very, very,',
     'bi:charset:utf-8 charset:utf-8',
     'bi:megamek megamek',
     'bi:time time',
     'bi:[input] [input]',
     'bi:billboard billboard',
     'bi:the the',
     'bi:state state',
     'bi:prince prince',
     'bi:subject:$ subject:$',
     'bi:phpmyadmin phpmyadmin',
     'bi:amsn amsn',
     'bi:fund fund',
     'bi:against, against,',
     'bi:camera camera',
     'bi:received:mailnull at localhost) received:mailnull at localhost)',
     'bi:pago pago',
     'bi:chicago chicago',
     'bi:charset:iso-8859-1 charset:iso-8859-1',
     'bi:pdfcreator pdfcreator',
     'bi:gour gour',
     'bi:subject:. subject:.',
     'bi:received:30950 received:30950',
     'bi:subject:- subject:-', "bi:subject:' subject:'", 'bi:fma fma',
     'bi:subject:.. subject:..',
     'bi:miktex miktex',
     'bi:this this',
     'bi:help help',
     'bi:url:2 url:2',
     'bi:fluid fluid',
     'bi:sell, sell,',
     'bi:$50.00 $50.00',
     'bi:forum forum',
     'bi:scummvm scummvm',
     'bi:url:com url:com',
     'bi:received:2612 received:2612',
     'bi:download download',
     'bi:hanukah hanukah',
     'bi:becomes becomes',
     'bi:men men',
     'bi:url:ami url:ami',
     'bi:subject:2003 subject:2003',
     'bi:*** ***',
     'bi:encore encore',
     'bi:virus:src="cid: virus:src="cid:',
     'bi:subject:You subject:You',
     'bi:filezilla filezilla',
     'bi:received:3948 received:3948',
     'bi:charset:windows-874 charset:windows-874',
     'bi:content-type:text/plain content-type:text/plain',
     'bi:subject:, subject:,',
     'bi:url:contactus url:contactus',
     'bi:charset:windows-1252 charset:windows-1252',
     'bi:have have',
     'bi:url:catalog url:catalog',
     'bi:or: or:',
     'bi:aid aid',
     'bi:url:sendmail url:sendmail',
     'bi:url:%s url:%s',
     'bi:url:tracking url:tracking',
     'bi:described described',
     'bi:you you',
     'bi:music music',
     'bi:springs springs',
     'bi:any any',
     'bi:charset:us-ascii charset:us-ascii',
     'bi:url:email-reports url:email-reports',
     'bi:url:cgi url:cgi',
     'bi:url:newsletter_2003_oct url:newsletter_2003_oct',
     'bi:indianapolis indianapolis',
     'bi:dev-c++ dev-c++',
     'bi:subject:* subject:*',
     'bi:url:forums url:forums',
     'bi:relix relix',
     'bi:mau mau',
     'bi:subject:: subject::',
     'bi:$$$ $$$',
     'bi:url:signup url:signup',
     'bi:#include #include',
     'bi:%s, %s,',
     'bi:speech speech',
     'bi:content-type:image/gif content-type:image/gif',
     'bi:url:news url:news',
     'bi:record, record,',
     'bi:url:3 url:3',
     'bi:subject:/ subject:/',
     'bi:gaim gaim',
     'bi:bang bang',
     'bi:>> >>',
     'bi:charset:windows-1256 charset:windows-1256',
     'bi:liberopops liberopops',
     'bi:url: url:',
     'bi:subject:spambayes subject:spambayes',
     'bi:url:complaint url:complaint',
     'bi:received:jln at localhost) received:jln at localhost)',
     'bi:free free',
     'bi:coast coast',
     'bi:received:16781 received:16781',
     'bi:following following',
     'bi:url:xdr2 url:xdr2',
     'bi:card card',
     'bi:a1> a1>',
     'bi:unsubscribe unsubscribe',
     'bi:toshiba toshiba',
     'bi:jingle jingle',
     'bi:charset:iso-2022-jp charset:iso-2022-jp',
     'bi:subject:% subject:%',
     'bi:your your']

I suppose some of them might make sense, but most are probably artifacts.
Maybe bigrams should only be generated of the current and previous tokens
differ.

Skip



More information about the spambayes-dev mailing list