[Spambayes] URL parsing improvement ideas

Fri Aug 29 10:00:17 EDT 2003

spambayes-request at python.org wrote:

>From: "Meyer, Tony" <T.A.Meyer at massey.ac.nz>
>Subject: RE: [Spambayes] URL parsing improvement ideas
>
>  
>
>>> I wonder why 1) was a loss.
>>    
>>
>
>I suspect it is because the comment is right, and the presence of %
>escapes is a clue.
>
Yeah, but the idea behind these % must be that the Bayes database does 
not have that combination in it before. If you have a 8 letter domain, 
for example, then there are a lot of combinations if you use % encoding 
for a letter or more.

>>> Perhaps it should add a special 
>>> token when it finds any % escapes, and then replace them. 
>>> Care to try this as well? 
>>    
>>
>
>I'm not sure exactly what you mean.  If I have "read%20me.html", do you
>mean there is a token "read", a token "me", a token "html", and a "url:
>has_escape" token?
>
Yes, exactly. But I originally thought only about the host name. There 
is no reason to escape the host name, only spammers do it, to confuse Bayes.

>>> Please send me the source code
>>    
>>
>
>Note that this might not be (and for 2 *is* not) the fastest/best way to
>do these things.  I was just going for a quick implementation to test
>the concepts.
>
Great, thanks a lot! :-)

>>>> > 1) Replace % escapes.
>>>      
>>>
>
>I added this after line 985 of tokenizer.py.
>"""
>            import urllib
>            piece = urllib.unquote(piece)
>"""
>
>  
>
>>>> > 2) Find server names for ip addresses.
>>>      
>>>
>
>(Results are still coming). I added this after line 984 of tokenizer.py.
>"""
>            if '.' in piece:
>                import socket
>                try:
>                    piece = socket.gethostbyaddr(piece)[0]
>                except:
>                    pass
>"""
>  
>
Do you have so many ip addresses instead of host names in urls? This 
should not take *so* long. Or are you resolving file names as well 
(readme.htm)?

>>>> > 3) Remove numbers from the end of domain names (experimental).
>>>> >    www.buythis123.com => url:buythis
>>>      
>>>
>
>I added this after line 985 of tokenizer.py.
>"""
>                while chunk and chunk[-1] in '0123456789':
>                    chunk = chunk[:-1]
>"""
>
>  
>
>>>> > Or add a special token for domains ending with a number.
>>>      
>>>
>
>I added this after line 985 of tokenizer.py.
>"""
>                if chunk and chunk[-1] in '0123456789':
>                    pushclue("url: ends_in_number")
>"""
>
>=Tony Meyer
>