filter valid email addresses

Andrew Dalke adalke at mindspring.com
Sat Oct 11 22:33:52 EDT 2003


Hoang:
> anyone know of an algorithm to filter out real email addresses as opposed
to
> computer generated email addresses?  I have been going through past email
> archives in order to find friends email address.  Unfortunately about 75%
of
> them are junk addresses or spammer addresses.

Why just look at the email addresses?  Since you have the emails
themselves, try this.  Get SpamBayes or any of the other systems you
can use to recognize ham/spam.  Find the emails where the addresses
are used more than once.  These are much more likely to be from
your friends.  Use these emails as ham.  From the remaining addresses,
identify some of the spam.  Train SpamBayes on this and use it
to classify the remaining emails.  These can be sorted from most
ham-like to most spam-like, making it easier to identify valid emails
and hence valid email addresses.

                    Andrew
                    dalke at dalkescientific.com






More information about the Python-list mailing list