[Spambayes] URL parsing improvement ideas
Harri Pesonen
fuerte at sci.fi
Fri Aug 29 10:00:17 EDT 2003
spambayes-request at python.org wrote:
>From: "Meyer, Tony" <T.A.Meyer at massey.ac.nz>
>Subject: RE: [Spambayes] URL parsing improvement ideas
>
>
>
>>> I wonder why 1) was a loss.
>>
>>
>
>I suspect it is because the comment is right, and the presence of %
>escapes is a clue.
>
Yeah, but the idea behind these % must be that the Bayes database does
not have that combination in it before. If you have a 8 letter domain,
for example, then there are a lot of combinations if you use % encoding
for a letter or more.
>>> Perhaps it should add a special
>>> token when it finds any % escapes, and then replace them.
>>> Care to try this as well?
>>
>>
>
>I'm not sure exactly what you mean. If I have "read%20me.html", do you
>mean there is a token "read", a token "me", a token "html", and a "url:
>has_escape" token?
>
Yes, exactly. But I originally thought only about the host name. There
is no reason to escape the host name, only spammers do it, to confuse Bayes.
>>> Please send me the source code
>>
>>
>
>Note that this might not be (and for 2 *is* not) the fastest/best way to
>do these things. I was just going for a quick implementation to test
>the concepts.
>
Great, thanks a lot! :-)
>>>> > 1) Replace % escapes.
>>>
>>>
>
>I added this after line 985 of tokenizer.py.
>"""
> import urllib
> piece = urllib.unquote(piece)
>"""
>
>
>
>>>> > 2) Find server names for ip addresses.
>>>
>>>
>
>(Results are still coming). I added this after line 984 of tokenizer.py.
>"""
> if '.' in piece:
> import socket
> try:
> piece = socket.gethostbyaddr(piece)[0]
> except:
> pass
>"""
>
>
Do you have so many ip addresses instead of host names in urls? This
should not take *so* long. Or are you resolving file names as well
(readme.htm)?
>>>> > 3) Remove numbers from the end of domain names (experimental).
>>>> > www.buythis123.com => url:buythis
>>>
>>>
>
>I added this after line 985 of tokenizer.py.
>"""
> while chunk and chunk[-1] in '0123456789':
> chunk = chunk[:-1]
>"""
>
>
>
>>>> > Or add a special token for domains ending with a number.
>>>
>>>
>
>I added this after line 985 of tokenizer.py.
>"""
> if chunk and chunk[-1] in '0123456789':
> pushclue("url: ends_in_number")
>"""
>
>=Tony Meyer
>
More information about the Spambayes
mailing list