[Tutor] regex advise + email validate

Steven D'Aprano steve at pearwood.info
Fri Oct 1 15:20:51 CEST 2010


On Fri, 1 Oct 2010 09:34:01 pm Norman Khine wrote:
> hello, i have this code
>
> http://pastie.org/1193091
>
> i would like to extend this so that it validates TLD's such as
> .travel and .museum, i can do this by changing {2,4} to {2,7} but
> this sort of defeats the purpose of validating the correct email
> address.

The only good advice for using regular expressions to validate emails 
addresses is... 

Don't.

Just don't even try.

The only way to validate an email address is to actually try to send 
email to it and see if it can be delivered. That is the ONLY way to 
know if an address is valid.

First off, even if you could easily detect invalid addresses -- and you 
can't, but for the sake of the argument let's pretend you can -- then 
this doesn't help you at all. fred at example.com is syntactically valid, 
but I guarantee that it will *never* be deliverable.

asgfkagfkdgfkasdfg at hdsgfjdshgfjhsdfg.com is syntactically correct, and 
it *could* be a real address, but if you can actually deliver mail to 
it, I'll eat my hat.

If you absolutely must try to detect syntactically invalid addresses, 
the most you should bother is to check that the string isn't blank. If 
you don't care about local addresses, you can also check that it 
contains at least one @ sign. (A little known fact is that email 
addresses can contain multiple @ signs.) Other than that, leave it up 
to the mail server to validate the address -- which it does by trying 
to deliver mail to it.

Somebody has created a Perl regex to validate *some* email addresses. 
Even this one doesn't accept all valid addresses, although it comes 
close:

http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html

Read it and weep.

See here for more info:

http://northernplanets.blogspot.com/2007/03/how-not-to-validate-email-addresses.html

This is exactly the sort of thing that you should avoid like the plague:

http://www.regular-expressions.info/email.html

This is full of bad advice. This clown arrogantly claims that his 
regex "matches any email address". It doesn't. He then goes on to 
admit "my claim only holds true when one accepts my definition of what 
a valid email address really is". Oh really? What about the RFC that 
*defines* what email addresses are? Shouldn't that count for more than 
the misinformed opinion of somebody who arrogantly dismisses bug 
reports for his regex because it "matches 99% of the email addresses in 
use today"?

99% sounds like a lot, but if you have 20,000 people use your software, 
that's 200 whose valid email address will be misidentified.

He goes on to admit that his regex wrongly rejects .museum addresses, 
but he considers that acceptable. He seriously suggests that it would 
be a good idea for your program to list all the TLDs, and even all the 
country codes, even though "by the time you read this, the list might 
already be out of date".

This is shonky programming. Avoid it like poison.


-- 
Steven D'Aprano


More information about the Tutor mailing list