[Tutor] regex advise + email validate
Steven D'Aprano
steve at pearwood.info
Fri Oct 1 15:20:51 CEST 2010
On Fri, 1 Oct 2010 09:34:01 pm Norman Khine wrote:
> hello, i have this code
>
> http://pastie.org/1193091
>
> i would like to extend this so that it validates TLD's such as
> .travel and .museum, i can do this by changing {2,4} to {2,7} but
> this sort of defeats the purpose of validating the correct email
> address.
The only good advice for using regular expressions to validate emails
addresses is...
Don't.
Just don't even try.
The only way to validate an email address is to actually try to send
email to it and see if it can be delivered. That is the ONLY way to
know if an address is valid.
First off, even if you could easily detect invalid addresses -- and you
can't, but for the sake of the argument let's pretend you can -- then
this doesn't help you at all. fred at example.com is syntactically valid,
but I guarantee that it will *never* be deliverable.
asgfkagfkdgfkasdfg at hdsgfjdshgfjhsdfg.com is syntactically correct, and
it *could* be a real address, but if you can actually deliver mail to
it, I'll eat my hat.
If you absolutely must try to detect syntactically invalid addresses,
the most you should bother is to check that the string isn't blank. If
you don't care about local addresses, you can also check that it
contains at least one @ sign. (A little known fact is that email
addresses can contain multiple @ signs.) Other than that, leave it up
to the mail server to validate the address -- which it does by trying
to deliver mail to it.
Somebody has created a Perl regex to validate *some* email addresses.
Even this one doesn't accept all valid addresses, although it comes
close:
http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html
Read it and weep.
See here for more info:
http://northernplanets.blogspot.com/2007/03/how-not-to-validate-email-addresses.html
This is exactly the sort of thing that you should avoid like the plague:
http://www.regular-expressions.info/email.html
This is full of bad advice. This clown arrogantly claims that his
regex "matches any email address". It doesn't. He then goes on to
admit "my claim only holds true when one accepts my definition of what
a valid email address really is". Oh really? What about the RFC that
*defines* what email addresses are? Shouldn't that count for more than
the misinformed opinion of somebody who arrogantly dismisses bug
reports for his regex because it "matches 99% of the email addresses in
use today"?
99% sounds like a lot, but if you have 20,000 people use your software,
that's 200 whose valid email address will be misidentified.
He goes on to admit that his regex wrongly rejects .museum addresses,
but he considers that acceptable. He seriously suggests that it would
be a good idea for your program to list all the TLDs, and even all the
country codes, even though "by the time you read this, the list might
already be out of date".
This is shonky programming. Avoid it like poison.
--
Steven D'Aprano
More information about the Tutor
mailing list